Putting these together we plot the following FP64 performance pyramid for the V100:
Interpreting this: each band tells us the maximum performance we can expect for a given range of combined shared + L1 memory accesses where we have assumed the same memory bus is used for both caches.
Note: a kernel may drop below the performance band suggested by a naive estimate of arithmetic intensity and local cache accesses if there is one or more additional performance limiter. For instance: use of special function units, extremely low occupancy caused by excess registers or shared memory usage, or spilling to local (i.e. L1) memory may impact performance.
Example: if a kernel accesses 6 bytes for every floating point operation it would be in the orange band and achieve at most just over 2 TFLOPS/s.
For comparison we include the performance pyramid for a Pascal P100 12GB PCI-E card from an earlier post:
The gray zone of inaccessible performance is less pronounced in the V100 chart.
The P100 has separate shared and L1 caches whereas the V100 has a unified L1 and shared cache.
The V100 has an extra performance band for kernels that can live with less than 2 bytes of shared memory access per floating point operation.
In future posts we will discuss our attempts to write compute kernels that can access the green band on the V100 GPU.