Taking Notes



Our Recent Posts



No tags yet.

Concurrent Cloud Computing: running OCCA

Alas our NVIDIA Titan V order didn't go through. Instead I am gearing up to run on an NVIDIA V100 GPU equipped server at paperspace.com [ This is not an endorsement, proceed at your own risk, your mileage may vary, and there are other cloud providers with GPU compute nodes.] The first step after setting up a new machine instance and logging on is to install the necessary compilers and If everything was set up correctly you should get a report about the available threading APIs available, and associated computed devices. I haven't yet tried gained access to a V100 node, but on their NVIDIA Quadro M400 node I get the following: The next step is to build and run one of the examples bundled w

Limiting Performance: an interesting read

There is an interesting personal essay on the history and developments in the art and design of numerical schemes to limit spurious oscillations in solutions of nonlinear PDEs even in the presence of shocks here. I will echo one point that the Sweby diagram (pictured above with cite containing explanation below) universally captures the nature of classical TVD limiters. As a reviewer and associate journal editor I puzzle over the large number of new articles submitted on the topic. Sweby, Peter K. High resolution schemes using flux limiters for hyperbolic conservation laws. SIAM Journal on Numerical Analysis, Volume 21, Issue 5, 1984, dx.doi link.

Portable Performance Profiling: occaBench

The mixbench micro-benchmarking utility (available on github here and documented in the references below) is a tool for measuring the data throughput and computational throughput of a mixed streaming & compute workload. We have found it to be extremely useful for creating an empirically calibrated roofline model. We ported the OpenCL version of mixbench to OCCA and ran the code in CUDA mode on a 12 GB PCI-E NVIDIA P100 GPU with: ./occaBenchMain 10240000 "mode=CUDA, deviceID=0" |& grep RES Plotting the output in MATLAB gives excellent rooflines for FP32, FP64, and INT32 mixed arithmetic and streaming: In a previous post on the Pascal Performance Pyramid we discussed how it is nigh on imposs

Spurious Solution Suppression: the Goldilocks upwind discontinuous Galerkin Time-domain method

There are roughly three schools of thought about how much stabilization should be added to control the continuity of solutions obtained through discontinuous Galerkin discretizations of time-dependent linear wave problems (e.g. acoustics, electromagnetics, linear elasticity). We can view each of these methods as one instance of a singly parameterized method. Without dwelling on too many details the variational equation for a symmetric linear hyperbolic wave system on each element is Here q is a state vector approximated by polynomials of degree N on element D^e. The parameter tau is a tunable non-negative quantity. In this blog entry we consider three values for tau that appear in the lite

High-order Discontinuous Galerkin Simulations: is single precision enough?

It is tempting to use 32 bit floating point arithmetic (FP32) on GPUs. Modern consumer grade cards from NVIDIA have theoretical peak performance of 10 TFLOPS. However, we do have to be careful about when it is safe to use single precision in high-order calculations. The above simulation was run once with FP32 (single precision shown on the left) and run once with FP64 (double precision shown on the right). You might notice a granular nature of the vorticity field in the left single precision image. This is not an artifact of jpeg compression. The discontinuous Galerkin discretization incorporates a term that adds jump terms scaled by a "lift parameter" to approximate derviatives in each ele

Dude Where's My FLOPS ?

Computing the Mandelbrot fractal is seemingly a perfect application for the GPU. It is simple to state: iterate z = z^2 + c, starting with z=0 for a set of values c drawn from the complex plane (wiki). The number of iterations it takes for |z| to exceed 4 is recorded and used to color a pixel in the Mandelbrot image. The image on the left is an example Mandelbrot image where each pixel corresponds to a different choice of c. Since the iteration for each value of c is independent of all the other iterations required for the other values of c this is an example of an embarrassingly parallel task. Furthermore the only time that data needs to be written to memory is when the iteration count is r

Pascal Processor Powerhouse: the beer fridge sized cluster

This is the new Pascal GPU cluster hosted in the Math Department at VT. It consists of four compute nodes, each equipped with six NVIDIA GTX 1080 TI GPUs. Each GPU has a nominal peak FP32 performance of approximately 10 TFLOPS/s (link) hence the cluster can potentially process approximately 240 TFLOPS/s. In practice actual achieved floating point performance depends strongly on the nature of the specific application. This equipment was provided by the John K. Costain faculty chair and the College of Science at VT.

Climbing the Pascal Performance Pyramid: 10 rules of thumb

Rules of thumb for optimizing FP64 CUDA kernels on the Pascal class NVIDIA P100 GPU [ numbers vary depending on the specific model, the following are measured on the 12GB PCI-Express version ] Rule #1 - device global memory bandwidth is limited The device memory bandwidth is listed at 549 GB/s, but cudaMemcpy only typically achieves ~400 GB/s for device-to-device copies. We are very unlikely to be able to exceed the performance of a device-to-device copy with a kernel. The gray region in the above roofline plot shows how this bandwidth gap limits achievable performance. Achieved global device memory throughput can be measured using NVIDIA's nvprof command line profiler (link): nvprof -


225 Stanger St
Blacksburg, VA 24061

  • Twitter
  • LinkedIn

©2018 by the Parallel Numerical Algorithms  Research Group @VT.