When I started writing GPU code, I often heard that using shared memory is the only way to get good performance out of my code. As I kept diving more and more into CUDA programming and performance analysis, I understood that obtaining good performance on a GPU is a far more involved task.
In this post I will discuss basic optimization strategies for the GPUs.
Understanding memory and its hierarchy: In terms of speed, GPU memory can be organized as follows:
Short characteristics of each memory type:
Global memory is typically large (from 1GB to 16GB). It allows for read and write access and it is allocated by host on the device. All threads can access global memory.
Local memory is an abstraction of global memory. On newer cards, local memory is cached to L1 cache. Similarly to global memory, it allows for read and write accesses. This type of memory, however, is managed by compiler. Only a thread can access its own local memory.
Texture memory is a read-only cache for global device memory. It is allocated by host on the device. All threads can access texture memory.
Constant memory is also a read-only cache for global device memory. Constant memory is allocated by host on device. It is not particularly large - for example, it has 64KB on Pascal P100 GPU. Constant memory is cached to constant cache and it can be accessed by each thread. Constant memory is fast as long as threads in the same half-warp read the same address, otherwise accesses are serialized. Constant memory is managed by the compiler.
Shared memory is shared by all threads in a threadblock. The maximum size is 64KB per SM but only 48KB can be assigned to a single block of threads (on Pascal-class GPU). Again: shared memory can be accessed by all threads in the same block. Shared memory is explicitly managed by the programmer but it is allocated by device on device.
Registers belong to each thread (each thread uses its own registers). Also, each SM has its own register file. Generally, register allocation is managed by compiler.
Note: GPUs typically do not have an L3 cache and the typical current L1 and L2 caches are rather small. For example, on the Pascal P100, the L1 cache is 24KB per SM and L2 cache is 4 MB per GPU.
All global memory accesses on newer cards are cached to L2 and L1 is reserved for local memory accesses.
Next we will look at how to efficiently access and use these different memory types.
Minimize data transfer: the performance of a GPU kernel is not always, but typically limited by data transfer. In our finite element codes we often need to read the field variable(s) from an array q, read the geometric factors, and sometimes read additional operator matrices. Assuming that the data is already resident in global GPU memory, it takes a couple hundred processor cycles to retrieve it. While we cannot avoid loading the data during the kernel execution, we always try to optimize the loading process, and we strive to read each array/variable only once. The same rule applies to writing to global memory.
This is an example of inefficient code (it writes to global memory variable Mq too many times)
The code can be optimized by first accumulating the result into a temporary register variable tmp before writing it out once to global device memory:
Optimize memory: constant memory (const != constant memory)
In many cases – certainly for the finite element codes – some of the variables and arrays are not altered during the course of kernel execution. For example, in finite element codes, geometric factors are only read during kernels computing the action of a stiffness or mass matrix.
It is a good programming practice to declare pointers to arrays of unchanging data using the const quantifier. This informs the compiler that variables in the array will not change throughout the kernel execution and access can be optimized i.e., placed in constant cache. Note: it is up to compiler what really happens. However, using const gives the compiler more flexibility in how to organize memory accesses.
can be changed to
Const can be used in the kernel code as well.
Manual caching with shared memory and register variables: Shared memory has approximately lower latency and about 10x higher bandwidth than global device memory . This type of memory is shared among all threads in the same block.
There are three main uses for the shared memory:
An array is accessed by all threads in the same block. For example, operator matrices in finite element codes are such arrays. This array might be placed in shared memory.
An array must to be visible to all threads in the block. A field variable in finite element codes is a good example of such array. This array might be placed in shared memory.
An array is accessed multiple times by the same thread and other resources (registers) are used elsewhere. This array might be placed in shared memory.
In the first case, we must be careful. If the array is very small and the same for all thread blocks, it might be more beneficial (for the performance) to rely on L1 cache instead.
We can say that in the second case shared memory is used for communication between threads in a block. In fact, this is with the exception of the CUDA register shuffle operations and the device atomic memory operations the only "safe" way for inter-thread communication.
Register memory is the fastest GPU memory. Each SM has its own register file - it is not cached. While every thread can use up to 255 registers (on Pascal), the size of register file is 256K per SM, and the total size is 14,336 KB.
Note: the more registers per thread you use, the lower the occupancy. Thus, using too many registers might hurt performance.
Optimize memory by careful use of local memory spills: Local memory is not a separate memory, but an abstraction of global memory. It is used when the compiler decides there is not enough resources on SM. Register spills will result in using L1 cache. and if the L1 cache is over capacity then global memory.
Local memory does not have to strongly impact performance, as long as your spills do not overflow L1.
Optimize execution: instruction-level parallelism
Consider the two lines of code below.
The second line depends on the first line so it cannot be executed until the first line is done. Try to eliminate code blocks of this kind. For example, a code block above can be simplified to
The first two instructions depend on each other. If you swap second and third instruction, you get
Then, updating a and d can be done at the same time. Avoiding dependencies as shown above is particularly important for instructions that require reading from or writing to global memory.
Optimize execution: loop unrolling
Assume that there is a for-loop inside your kernel and all the iterations of this loop are independent. To increase parallelism, try unrolling the loops using #pragma unroll (in CUDA) or occaUnroll(number of loop iterations) in occa. Unrolling gives compiler more freedom to schedule the instructions.
Word of warning: unrolling may increase register pressure and might decrease occupancy and - in some cases - hurt performance.
Can be changed to
Optimize execution: warp divergence
Since threads are executed in warps of 32 threads each, we want all threads in the same warp to execute exactly the same set of instructions. Hence, if-statements that cause warp divergence are bad for performance.
The following set of instructions is bad. The first 11 threads in each warp execute different set of instructions that the remaining threads in this warp.
The code below is much better - there is an if-statement, but in this case an entire warp executes different code than other warps.
While the application of the strategies mentioned above is a good starting point, optimization is an art. We often need to structure and restructure the code in multiple different ways to obtain the performance we expect. To keep the work organized, we often write multiple versions of the same kernel.