Portable Performance Profiling: occaBench
The mixbench micro-benchmarking utility (available on github here and documented in the references below) is a tool for measuring the data throughput and computational throughput of a mixed streaming & compute workload. We have found it to be extremely useful for creating an empirically calibrated roofline model.
We ported the OpenCL version of mixbench to OCCA and ran the code in CUDA mode on a 12 GB PCI-E NVIDIA P100 GPU with:
./occaBenchMain 10240000 "mode=CUDA, deviceID=0" |& grep RES
Plotting the output in MATLAB gives excellent rooflines for FP32, FP64, and INT32 mixed arithmetic and streaming:
In a previous post on the Pascal Performance Pyramid we discussed how it is nigh on impossible to hit the theoretical peak on this GPU of ~10TFLOPS (FP32) and 5TFLOPS (FP64) due to effective data streaming rates. This is confirmed again by this mixed workload benchmark.
A nice feature of the OCCA implementation of occaBench is that we can switch to running the benchmark in native OpenCL mode. When running occaBench we simply change the command line arguments:
./occaBenchMain 10240000 "mode=OpenCL, deviceID=0, platformID=0" |& grep RES
The OpenCL results are reassuringly similar to CUDA for the P100:
OCCA also supports OpenMP running on CPUs. By changing the command line argument we can run on available CPUs using:
./occaBenchMain 10240000 "mode=OpenMP" |& grep RES
Running on two Intel XeonE5-2680 CPUs on a node of the newriver cluster at VT we obtain the following performance:
It useful that we can run the occaBench kernels in this cross-platform manner, but it is also evident that although the code achieves good performance in OpenCL and CUDA modes it is not running as effectively on the CPUs in OpenMP mode. It is likely that the kernel will need to be rewritten to fully exploit the CPU.
To generate one more potentially insightful data point we ran the code on a GTX 1080 TI in CUDA mode:
Interestingly the peak FP32 performance of the GTX 1080 TI is higher than the P100. This might be explained if the ECC memory error checking that the P100 performs is reducing the effective memory bus bandwidth. Finally, it is clear that as we expect the FP64 performance of the GTX 1080 TI is very low compared to the FP32 performance.
Release: we will release the occaBench code on github in the near future.
Background: the original mixbench micro-benchmarking that occaBench is based on see the following papers by Konstantinidis & Cotronis: Elias Konstantinidis, Yiannis Cotronis, "A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling", Journal of Parallel and Distributed Computing, Volume 107, 2017, Pages 37-56, dx.doi, publisher link.
Konstantinidis, E., Cotronis, Y., "A Practical Performance Model for Compute and Memory Bound GPU Kernels", Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on , pp.651-658, dx.doi, publisher link.
Note: figures corrected 01/19/18.