Rough-n-Ready Roofline: NVIDIA V100 edition
In this post we discuss rules of thumb for performance limiters when using shared memory in a NVIDIA V100 CUDA compute kernel.
The V100 16GB PCI-E card has:
Theoretical device memory bandwidth of 900GB/s. Using cudaMemcpy we measure achievable memory bandwidth of 790GB/s.
Combined shared memory & L1 cache with which we guesstimate to have throughput: (SH + L1) GB/s = 80 (cores) x 32 (simd width) x 4 (word bytes) x 1.245 (base clock) ~= 12.748 TB/s
Theoretical peak flops of 7TFLOPS/s (link)
Putting these together we plot the following FP64 performance pyramid for the V100:
Interpreting this: each band tells us the maximum performance we can expect for a given range of combined shared + L1 memory accesses where we have assumed the same memory bus is used for both caches. Note: a kernel may drop below the performance band suggested by a naive estimate of arithmetic intensity and local cache accesses if there is one or more additional performance limiter. For instance: use of special function units, extremely low occupancy caused by excess registers or shared memory usage, or spilling to local (i.e. L1) memory may impact performance.
Example: if a kernel accesses 6 bytes for every floating point operation it would be in the orange band and achieve at most just over 2 TFLOPS/s.
For comparison we include the performance pyramid for a Pascal P100 12GB PCI-E card from an earlier post:
Noticeable differences:
The gray zone of inaccessible performance is less pronounced in the V100 chart.
The P100 has separate shared and L1 caches whereas the V100 has a unified L1 and shared cache.
The V100 has an extra performance band for kernels that can live with less than 2 bytes of shared memory access per floating point operation.
In future posts we will discuss our attempts to write compute kernels that can access the green band on the V100 GPU.