When I started writing GPU code, I often heard that using shared memory is the only way to get good performance out of my code. As I kept diving more and more into CUDA programming and performance analysis, I understood that obtaining good performance on a GPU is a far more involved task. In this post I will discuss basic optimization strategies for the GPUs. Understanding memory and its hierarchy: In terms of speed, GPU memory can be organized as follows: Global memory + loc
A new Titan V arrived at Virginia Tech today. Installation went relatively smoothly thanks to the patience of Bill Reilly. The Titan V differs from the NVIDIA Tesla V100 in a couple of significant ways. The Tesla V100 has peak bandwidth 900GB/s and L2 cache of 6GB compared to 652GB/s and 4.5GB respectively for the Titan V. For full specs see wikipedia. We subjected the Titan V to the occaBench testing protocol derived from the mixbench hybrid streaming/compute benchmarking of
In this post we discuss rules of thumb for performance limiters when using shared memory in a NVIDIA V100 CUDA compute kernel. The V100 16GB PCI-E card has: Theoretical device memory bandwidth of 900GB/s. Using cudaMemcpy we measure achievable memory bandwidth of 790GB/s. Combined shared memory & L1 cache with which we guesstimate to have throughput: (SH + L1) GB/s = 80 (cores) x 32 (simd width) x 4 (word bytes) x 1.245 (base clock) ~= 12.748 TB/s Theoretical peak flops of 7T
Overview: This week we have been experimenting with instances on Amazon AWS and Paperspace that come equipped with NVIDIA V100 GPUs. These GPUs are hot properties and not widely available, so we had to request special access to V100 equipped instances on both systems. Both AWS and Paperspace responded quickly to our requests. The Paperspace support team was also incredibly responsive, patient, and helpful getting through some minor technical issues.
Note: this article is no
We created an Amazon EC2 instance with NVIDIA V100 GPU. We will discuss that process in more detail in a future posting. As usual this is not an endorsement of a particular cloud server provider or of a particular GPU model or manufacturer.
Running occaBench with default settings gave eroneously high throughput estimates, likely due to a different cache configuration than earlier GPUs. We creatively tweaked the work load for the benchmarking kernel (see mixbench ref below
VT CEED BP Software Release: the VT Parallel Numerical Algorithms team has released GPU optimized implementations for the Center for Efficient Exascale Discretization (CEED) bake-off competition on github here. The details are described in this report on arXiv. CEED Background: The CEED project is funded by the Department of Energy to develop new numerical simulation tools that can exploit the next generation of leadership compute facilities. Initial goals include adding GPU