Taking Notes



Our Recent Posts



No tags yet.

Basic GPU optimization strategies

When I started writing GPU code, I often heard that using shared memory is the only way to get good performance out of my code. As I kept diving more and more into CUDA programming and performance analysis, I understood that obtaining good performance on a GPU is a far more involved task. In this post I will discuss basic optimization strategies for the GPUs. Understanding memory and its hierarchy: In terms of speed, GPU memory can be organized as follows: Global memory + local memory (slowest) Texture memory Constant memory Shared memory Registers (fastest) Short characteristics of each memory type: Global memory is typically large (from 1GB to 16GB). It allows for read and write access and

Titan-V @ V-Tech: initial benchmarking results

A new Titan V arrived at Virginia Tech today. Installation went relatively smoothly thanks to the patience of Bill Reilly. The Titan V differs from the NVIDIA Tesla V100 in a couple of significant ways. The Tesla V100 has peak bandwidth 900GB/s and L2 cache of 6GB compared to 652GB/s and 4.5GB respectively for the Titan V. For full specs see wikipedia. We subjected the Titan V to the occaBench testing protocol derived from the mixbench hybrid streaming/compute benchmarking of Konstantinidis & Cotronis. The peak performance of the Titan V is 6.1 TFLOPS (FP64) and 12.2 TFLOPS (FP32). In the following figure we see that occaBench delivers very close to the manufacturer peak performance: The NV

Rough-n-Ready Roofline: NVIDIA V100 edition

In this post we discuss rules of thumb for performance limiters when using shared memory in a NVIDIA V100 CUDA compute kernel. The V100 16GB PCI-E card has: Theoretical device memory bandwidth of 900GB/s. Using cudaMemcpy we measure achievable memory bandwidth of 790GB/s. Combined shared memory & L1 cache with which we guesstimate to have throughput: (SH + L1) GB/s = 80 (cores) x 32 (simd width) x 4 (word bytes) x 1.245 (base clock) ~= 12.748 TB/s Theoretical peak flops of 7TFLOPS/s (link) Putting these together we plot the following FP64 performance pyramid for the V100: Interpreting this: each band tells us the maximum performance we can expect for a given range of combined shared + L1 mem

Concurrent Cloud Computing: installing occaBench for V100

Overview: This week we have been experimenting with instances on Amazon AWS and Paperspace that come equipped with NVIDIA V100 GPUs. These GPUs are hot properties and not widely available, so we had to request special access to V100 equipped instances on both systems. Both AWS and Paperspace responded quickly to our requests. The Paperspace support team was also incredibly responsive, patient, and helpful getting through some minor technical issues. Note: this article is not an endorsement of these companies or their products, we are just providing an insight into our experience getting started on their systems. Your mileage may vary. In our experience both systems were very similar once

Vaunted Volta Verified: initial comparison of the NVIDIA V100 & P100 GPUs

We created an Amazon EC2 instance with NVIDIA V100 GPU. We will discuss that process in more detail in a future posting. As usual this is not an endorsement of a particular cloud server provider or of a particular GPU model or manufacturer. Running occaBench with default settings gave eroneously high throughput estimates, likely due to a different cache configuration than earlier GPUs. We creatively tweaked the work load for the benchmarking kernel (see mixbench ref below for details of the original benchmark) and were able to obtain performance that did not exceed the manufacturer spec. The NVIDIA V100 16GB on the AWS instance has manufacturer peak spec of 7.8 TFLOPS (FP64) and 15.7 TFL

CEED Code Competition: VT software release

VT CEED BP Software Release: the VT Parallel Numerical Algorithms team has released GPU optimized implementations for the Center for Efficient Exascale Discretization (CEED) bake-off competition on github here. The details are described in this report on arXiv. CEED Background: The CEED project is funded by the Department of Energy to develop new numerical simulation tools that can exploit the next generation of leadership compute facilities. Initial goals include adding GPU acceleration to the MFEM modular finite element package and the Nek5000 finite element based flow solver code. The VT Parallel Numerical Algorithms team has used the standalone mini-app development strategy to fast track


225 Stanger St
Blacksburg, VA 24061

  • Twitter
  • LinkedIn

©2018 by the Parallel Numerical Algorithms  Research Group @VT.