In this post we discuss rules of thumb for performance limiters when using shared memory in a CUDA compute kernel running on a Titan V - coincidentally the topic of my advanced GPU+FEM topics VT course lecture today.
Putting these together we plot the following FP64 performance pyramid for the NVIDIA Titan V:
Interpreting this: each band tells us the maximum performance we can expect for a given range of combined shared + L1 memory accesses given that the L1 and shared memory is advertised as unified on the Volta cores.
Note that the gray area is the inaccessible performance due the discrepancy between theoretical (i.e. advertised) memory bandwidth and empirically measured DEVICE memory bandwidth. This is a little pessimistic since we have discounted the higher achievable memory bandwidth for memcopying with magic array lengths.
Thus to get close to 50% peak FP64 throughput we would need less than 4 bytes to be read from the unified shared+L1 cache per floating point operation.
To get close to peak (the gold band) we need less than 2 bytes to be read from the unified cache per floating point operation. Oh, and all the flops would need to be fused multiply adds !