Vaunted Volta Verified: initial comparison of the NVIDIA V100 & P100 GPUs
We created an Amazon EC2 instance with NVIDIA V100 GPU. We will discuss that process in more detail in a future posting. As usual this is not an endorsement of a particular cloud server provider or of a particular GPU model or manufacturer. Running occaBench with default settings gave eroneously high throughput estimates, likely due to a different cache configuration than earlier GPUs. We creatively tweaked the work load for the benchmarking kernel (see mixbench ref below for details of the original benchmark) and were able to obtain performance that did not exceed the manufacturer spec.
The NVIDIA V100 16GB on the AWS instance has manufacturer peak spec of 7.8 TFLOPS (FP64) and 15.7 TFLOPS (FP32). The occaBench code running on this GPU in OCCA:OpenCL mode on a vector of length 10,240,000 achieves the following performance:
The performance increases slightly for larger problem sizes.
For reference the NVIDIA P100 12GB PCI-E has manufacturer peak spec of 4.7 TFLOPS (FP64) and 9.3 TFLOPS (FP32). The same kernel running on this P100 GPU with OCCA:OpenCL mode on this GPU achieves:
Overall it appears that for this specific benchmark the V100 delivers approximately 50% better floating point arithmetic (both FP32 and FP64) as well apparently much better integer arithmetic (INT32). The empirical roofline slope of the V100 is double that of the P100. This can be attributed largely to the peak specs of the cards: 900 GB/s (V100) and 549 GB/s (P100 12GB). These results should be viewed as preliminary. We will update them as we gain more experience with the V100.
Release: we will release the occaBench code on github in the near future.
Background: for the original mixbench micro-benchmarking that occaBench is based on see the following papers by Konstantinidis & Cotronis: Elias Konstantinidis, Yiannis Cotronis, "A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling", Journal of Parallel and Distributed Computing, Volume 107, 2017, Pages 37-56, dx.doi, publisher link.
Konstantinidis, E., Cotronis, Y., "A Practical Performance Model for Compute and Memory Bound GPU Kernels", Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on , pp.651-658, dx.doi, publisher link.