# Climbing the Pascal Performance Pyramid: 10 rules of thumb

Rules of thumb for optimizing FP64 CUDA kernels on the Pascal class NVIDIA P100 GPU [ numbers vary depending on the specific model, the following are measured on the 12GB PCI-Express version ]

**Rule #1 - device global memory bandwidth is limited**
The device memory bandwidth is listed at 549 GB/s, but cudaMemcpy only typically achieves ~400 GB/s for device-to-device copies. We are very unlikely to be able to exceed the performance of a device-to-device copy with a kernel. The gray region in the above roofline plot shows how this bandwidth gap limits achievable performance.
Achieved global device memory throughput can be measured using NVIDIA's nvprof command line profiler (link):
nvprof --metrics dram_read_througput,dram_write_throughput ./foo

**Rule #2 - aim for the sloping roofline**

The sloped roofline at the bottom of the gray zone corresponds to the maximum FP64 floating point performance for a given arithmetic intensity. Ideally, a kernel with low arithmetic intensity will sit on the sloping roofline where the data is streaming at the maximal achievable 400GB/s.

We can measure the percentage of FP64 floating performance achieved with nvprof

nvprof --metrics flop_dp_efficiency ./foo

**Rule #3 - the top roofline requires high arithmetic intensity**

The P100 has 56 Pascal class streaming multiprocessors (SMP) with two 32-wide SIMD ALUs, running between 1.126 & 1.3GHz (specs). It is difficult to write kernels that achieve theoretical peak performance (the horizontal part of the roofline)

GF ~ #SMP x #SIMD x size(SIMD) x flops/cycle x base frequency ~ 56 x 2 x 16 x 2 flops x 1.126GHz

~ 4.035 TFLOPS/s

Combining with **Rule #1** we observe that the kernel must attain an arithmetic intensity of approximately
(4TFLOPS/s)/(0.4TB/s) ~ 10 FLOPS/byte

to achieve theoretical peak. This is reflected by the horizontal upper part of the roofline.

**Rule #4 - shared memory bandwidth is limited**

Each SMP has its own shared memory space. We estimate the total shared memory bandwidth over the entire device assuming that the shared memory manager on each SMP can serve one 32 bit request per SIMD unit per clock cycle:

BW(shared) ~ #SMP x size(WORD) x size(SIMD) x base frequency ~ 56 x 4B x 32 x 1.126GHz

~ 8.071 TB/s

For convenience we will use 8TB/s in the following.

**Rule #5 - avoid overusing shared memory **

The regions shaded purple/red/orange/green in the above roofline plot represent accessible performance where each color represents a different maximum number of accesses to shared memory variables per floating point operation.

For instance the purple region represents the maximum performance obtainable when at least two double precision variables are accessed

from shared memory per floating point operation.

The red region represents the maximum performance that can be achieved when less than two, but at least one FP64 shared memory variable is accessed for each floating point operation.

Likewise the orange region corresponds to less than one but more than one half FP64 shared memory access on average per flop.

Finally, the green region corresponds to less than one half FP64 shared memory access per flop.

**Rule #6 - block loops carefully to minimize accessing shared**

When we tune CUDA kernels for Pascal we typically aim to reduce the number of shared memory accesses by using shared memory for prefetch and then copying values from shared memory to register variables and using them repeatedly. This usually amounts to blocking loops carefully.

**Rule #7 - tune**

We time to find the optimal problem blocking parameters for each kernel. We will describe this process in a later blog entry.

**Rule #8 - be parsimonious with the 24KB L1 SMP cache**

We usually find that it is difficult to hit the 4 TFLOP/s theoretical peak performance of the P100 (the flat part of the top roofline). Although we are careful to use a combination of L2, L1, and shared memory caches we find that for our typical high-order finite element kernels that the small 24KB allocated to each SMP L1 cache is a serious performance limiter.

**Rule #9 - exploit data repetition**

This is rather specific to matrix-matrix multiplication. It is sometimes possible to exploit matrix structure when performing matrix-vector or matrix-matrix products to reduce the number of shared memory accesses per flop.

**Rule #10 - be realistic about attainable performance**

If a kernel reaches the green zone we are in reasonable shape. At that point we typically have to go back to the FEM formulation to get further performance improvement for instance by introducing extra structure or symmetry to the element operator matrices with a smarter choice of basis.

**Resources**: preprint on roofline guided optimization of high-order hex element operation kernels here. Vasily Volkov's paper on tuning matrix-matrix multiplication here.