Dude Where's My FLOPS ?
Computing the Mandelbrot fractal is seemingly a perfect application for the GPU. It is simple to state: iterate z = z^2 + c, starting with z=0 for a set of values c drawn from the complex plane (wiki). The number of iterations it takes for |z| to exceed 4 is recorded and used to color a pixel in the Mandelbrot image. The image on the left is an example Mandelbrot image where each pixel corresponds to a different choice of c.
Since the iteration for each value of c is independent of all the other iterations required for the other values of c this is an example of an embarrassingly parallel task. Furthermore the only time that data needs to be written to memory is when the iteration count is recorded. Thus this is an example of a compute bound problem. Assuming that we cap the maximum number of iterations at 256, and each iteration takes O(10) floating point operations the flops/byte ratio is very high with 2560 flops/4 bytes stored.
Example CUDA device function implementation for the Mandelbrot iteration is shown above where we have tried to goose the number of fused multiply add floating point operations. We also used a probably pointless trick by passing c/2 .
Profiling the code with nvprof on a P100 PCI-E 12GB model with
nvprof --metrics flop_dp_efficiency ./cudaMandelbrot 4096 4096
Yields the (maybe) surprising result: ~52% FP64 efficiency ! This "ideal" compute bound and embarrassingly parallel application is missing the peak FP64 performance by approximately 2TFLOPS/s. Why ?
We can do a little more detective work by scrutinizing all the available metrics that nvprof offers (using nvprof --metrics all ...).
One class of statistics stand out: flop_count_dp (total # of FP64 operations), flop_count_dp_add (# of FP64 additions), flop_count_mul (# of FP64 multiplications), and flop_count_dp_fma (# of fused FP64 multiply-adds). In the Mandelbrot calculation approximately half of the operations are adds, half are fmads, with a few multiplies.
Unfortunately, to get peak performance all the operations must be fmads. Thus in this case we are a priori limited to 75% of peak performance.
We can even see what instructions to the nvcc CUDA compiler generates when compiling the above device function by using the cuobjdump.
cuobjdump --dump-sass cudaMandelbrot
This command outputs the assembly code generated for the kernels (and device functions) of the compiled CUDA code with the relevant snippet shown below.
The while loop starts at line (address) 0138. By using --use_fast_math or --fmad=true nvcc command line option we coaxed the compiler into using three fmad operations (DFMA) but that there are also three add operations (DADD). Almost all the flops are generated in these few lines of code. This explains at least 1 TFLOPS/s of the flop deficit. It is not immediately obvious where the other one was lost. Stay tuned.