A new Titan V arrived at Virginia Tech today. Installation went relatively smoothly thanks to the patience of Bill Reilly.
The Titan V differs from the NVIDIA Tesla V100 in a couple of significant ways. The Tesla V100 has peak bandwidth 900GB/s and L2 cache of 6GB compared to 652GB/s and 4.5GB respectively for the Titan V. For full specs see wikipedia.
We subjected the Titan V to the occaBench testing protocol derived from the mixbench hybrid streaming/compute benchmarking of Konstantinidis & Cotronis. The peak performance of the Titan V is 6.1 TFLOPS (FP64) and 12.2 TFLOPS (FP32). In the following figure we see that occaBench delivers very close to the manufacturer peak performance:
The NVIDIA V100 16GB on the AWS instance we used previously has manufacturer peak spec of 7.8 TFLOPS (FP64) and 15.7 TFLOPS (FP32). The occaBench code running on this GPU in OCCA:OpenCL mode on a vector of length 10,240,000 achieves the following performance:
The most obvious difference is the slope of the memory bound part of these curves. It is clear that the additional memory bandwidth of the P100 means that it can achieve peak performance at a noticeably lower arithmetic intensity.
On the other hand Titan V is delivering surprisingly robust FP64 performance for the $3,000. It also achieves a sustained 620 GB/s memory throughput which is close to the peak of 652 GB/s.
These results should be viewed as preliminary. We will update them as we gain more experience with the Titan V.
Release: we will release the occaBench code on github in the near future.
Background: for the original mixbench micro-benchmarking that occaBench is based on see the following papers by Konstantinidis & Cotronis:
Elias Konstantinidis, Yiannis Cotronis, "A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling", Journal of Parallel and Distributed Computing, Volume 107, 2017, Pages 37-56, dx.doi, publisher link.
Konstantinidis, E., Cotronis, Y., "A Practical Performance Model for Compute and Memory Bound GPU Kernels", Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on , pp.651-658, dx.doi, publisher link.