Rough-n-ready Roofline: Titan V edition
In this post we discuss rules of thumb for performance limiters when using shared memory in a CUDA compute kernel running on a Titan V - coincidentally the topic of my advanced GPU+FEM topics VT course lecture today.
According to the Volta micro-architecture wiki entry, the Titan V card has the following characteristics:
Theoretical device memory bandwidth of 652 GB/s.
Using cudaMemcpy (from OCCA) we measure typical achievable memory bandwidth of 540GB/s [ note that for some magic array lengths we see 570GB/s but we discount that here ].
Combined shared memory & L1 cache with which we guesstimate to have throughput: (SH + L1) GB/s = 80 (cores) x 32 (simd width) x 4 (word bytes) x 1.2 (base clock) ~= 12.288 TB/s
Theoretical peak flops of 6.9 TFLOPS/s (link)
Putting these together we plot the following FP64 performance pyramid for the NVIDIA Titan V:
Interpreting this: each band tells us the maximum performance we can expect for a given range of combined shared + L1 memory accesses given that the L1 and shared memory is advertised as unified on the Volta cores.
Note that the gray area is the inaccessible performance due the discrepancy between theoretical (i.e. advertised) memory bandwidth and empirically measured DEVICE memory bandwidth. This is a little pessimistic since we have discounted the higher achievable memory bandwidth for memcopying with magic array lengths.
Thus to get close to 50% peak FP64 throughput we would need less than 4 bytes to be read from the unified shared+L1 cache per floating point operation.
To get close to peak (the gold band) we need less than 2 bytes to be read from the unified cache per floating point operation. Oh, and all the flops would need to be fused multiply adds !