Taking Notes

LOW DOWN ON HIGH-ORDER BLOG

 

Our Recent Posts

Archive

Tags

No tags yet.

Rough-n-ready Roofline: Titan V edition

In this post we discuss rules of thumb for performance limiters when using shared memory in a CUDA compute kernel running on a Titan V - coincidentally the topic of my advanced GPU+FEM topics VT course lecture today. According to the Volta micro-architecture wiki entry, the Titan V card has the following characteristics: Theoretical device memory bandwidth of 652 GB/s. Using cudaMemcpy (from OCCA) we measure typical achievable memory bandwidth of 540GB/s [ note that for some magic array lengths we see 570GB/s but we discount that here ]. Combined shared memory & L1 cache with which we guesstimate to have throughput: (SH + L1) GB/s = 80 (cores) x 32 (simd width) x 4 (word bytes) x 1.2 (base c

 

225 Stanger St
Blacksburg, VA 24061
USA.

  • Twitter
  • LinkedIn

©2018 by the Parallel Numerical Algorithms  Research Group @VT.