# CEED Code Competition: VT software release

VT CEED BP Software Release: the VT Parallel Numerical Algorithms team has released GPU optimized implementations for the Center for Efficient Exascale Discretization (CEED) bake-off competition on github here. The details are described in this report on arXiv.

CEED Background: The CEED project is funded by the Department of Energy to develop new numerical simulation tools that can exploit the next generation of leadership compute facilities. Initial goals include adding GPU acceleration to the MFEM modular finite element package and the Nek5000 finite element based flow solver code.

The VT Parallel Numerical Algorithms team has used the standalone mini-app development strategy to fast track the development and assessment of GPU accelerated components for these solvers.

CEED Bake-off Problems Objective: The CEED project team defined a set of numbered bake-off problems (BP). The whole CEED BP suite is detailed at the CEED website on this page. Each BP is carefully designed to conduct a feasibility study of a major component of finite element analysis. Each CEED BP is prognostic about the impact of both implementation issues and the architectural shifts that accompany next generation accelerators .

We used the VT NewRiver GPU compute nodes to develop a set OCCA compute kernels for the BP finite element operations (details). These operations include: BP1.0 - multiplication by elemental mass-matrix using an intermediate quadrature rule for integration.

BP3.0 - multiplication by elemental stiffness-matrix using an intermediate quadrature rule for integration. BP3.5 - multiplication by elemental stiffness-matrix using GLL nodes for integration (as used in the Nek5K code). VT CEED BP Software Quickstart: if you have already installed OCCA then installing/building/running the VT CEED BP codes should be straight forward:

git clone https://github.com/kswirydo/CEED-Ax

# BP1.0 cd CEED-Ax/BP10

make

# to perform mass-matrix-vector product on a mesh with 512 hex elements and polynomial degree 10 and tensor-product quadrature of 12^3 nodes per element

./BP10 512 10

# BP3.0

cd ../CEED-Ax/BP30

make

# to perform stiffness-matrix-vector product on a mesh with 512 hex elements and polynomial degree 10 and tensor-product quadrature of 12^3 nodes per element

./BP30 512 10

# BP3.5

cd ../CEED-Ax/BP35

make

# to perform stiffness-matrix-vector product on a mesh with 512 hex elements and polynomial degree 10 and tensor-product quadrature of 11^3 nodes per element

./BP35 512 10

Hint: You may need to adjust the OCCA device/platform/thread model for you specific hardware and system configuration. This can be done by editing the driver cpp files, making again, and rerunning.

VT CEED BP Software Results: each benchmark code consists of a family of kernel implementations for each finite-element operation. We intentionally retain the sequence of intermediate kernel code optimizations to illustrate their relative importance. The optimizations are detailed in the preprint linked below.

In the following figure we show for a range of polynomial degrees the spread of achieved performance measured in TFLOPS/s (i.e. 10^12 floating point operations per second) on an NVIDIA P100 12GB PCI-E GPU for the BP1.0 "mass-matrix-vector product" test case.

The empirical roofline is determined by timing how long it takes to move the same amount of data with device to device transfer using cudaMemcpy. In short, if the performance reaches this roofline then it takes the same total time to evaluate the matrix-vector product as it does to load the element node data, load the geometric factors, and store the result to device memory. We can see in the above figure that up to degree 11 we have in fact been able to find a kernel that achieves very close to this ideal roofline performance.

Narrowing to the degree 13 spectral element we see in the following figure that the difference between the fairly naive kernel #1 and the most optimized #8 kernel delivers an 18x speed up.

The roofline model and the kernel optimization strategies are detailed in the preprint linked below. Acknowledgement: Kasia Swirydowicz was instrumental in the development of many of the kernel optimizations for these bake-off problems and provided the figures for this blog entry.

Preprint: the VT CEED BP results are described in this preprint: "Acceleration of tensor-product operations for high-order finite element methods", by Kasia Swirydowicz, Noel Chalmers, Ali Karakus, T. Warburton available on arXiv.

Project Sponsorship: this research was funded in part by the DOE Exascale Computing Project. The activities also benefitted from the generous support from the John K. Costain Faculty Chair in the College of Science at Virginia Tech.