**Finite Element Stiffness Matrix Action: to BLAS or not to BLAS, that is the question.**

BLAS (Basic Linear Algebra Subprograms) is a specification for performing multiple common, basic linear algebra routines. BLAS functions have been implemented and optimized forÂ GPUs,Â Â and packaged in libraries (i.e., cuBLAS, CULA, and (batched) MAGMA) .

Â

A significant part of the work performed by our finite element codes consists of operations that can be expressed as BLAS routines (i.e., dot products, vector updates,Â matrix-vector, and matrix-matrix multiplications). We usually develop custom kernels thatÂ combine severalÂ BLAS-type operations to reduce memory traffic or memory usage. For instance weÂ often implement "matrix-free"Â operations where we apply the action of a matrix on a vector without storing the matrix explicitly.

Â

Context: In a recent postÂ we optimized a kernel that performs stiffness matrix action on tetrahedral elements. Alternatively, this operation can be expressed as matrix-matrix product plus a streaming kernel to apply the chain rule that relates derivatives computed with respect toÂ local elemental coordinates to derivatives in physical coordinates. In this post we analyze how our approach compares to cuBLAS.

Â

Implementation: The matrix action is executed as two operations:Â

Matrix-matrixÂ product using DGEMM or SGEMM. Matrix multiplication of aÂ Â field array q ofÂ size [Nelements xÂ Np] with anÂ operatorÂ matrix D of size [ NpÂ x (Np x 7)Â ]:Â

2. Multiplication by geometric factors to apply theÂ chain rule using a simple CUDA kernel, i.e., a matrix of sizeÂ [Nelements x ( Np xÂ 7)] is reduced to a matrix of size [Nelements x Np] by linear combination of the 7 derivatives computed using cuBLAS:Â

Note: Nelements = Number of elements in a FEM mesh andÂ Np is the num of nodes per element, determined by this formula for the size of Pascal's tetrahedron of depth N

Np = Â (N+1)(N+2)(N+3)/6, where N is the polynomial degree.

Â

Medium-sized mesh: We use a mesh with Nelements =Â Â 320,000 tetrahedra and we compare it with the best OCCA kernel (kernel 10) from this blog entryÂ on a Titan V using FP32 arithmetic:

In this case, our custom kernel clearly "wins",Â especially for lower degrees. At the lowest orders the computation is memory bound and using cuBLAS in the above manner requires us to write out 7 intermediate values per FEM node and then read them in again in the chain rule kernel. The custom version of the kernel compute the matrix product and chain rule without saving intermediate results, while achieving roofline performance through efficient blocking.

Â

The results are somewhat more interesting for double precision (FP64).

Â

Â While our custom approach wins for low degrees, it evens out with cuBLAS for N=7 and looses with cuBLAS for N=8. At the highest order our blocking algorithm is not as flexible as that used by cuBLAS and the time spent in streaming intermediate results is insignificant compared to the time spent in computing the matrix products.

Â

Performance comparison in relation toÂ problem size:Â ByÂ degrees of freedomÂ we understand the number ofÂ elements in the mesh multiplied by the number of nodes in the element. For example, for aÂ polynomial degree N=1, we have 4 nodes per element. If we use Â a mesh with 1M elements and set polynomial degree to N=1,Â we get 4M degrees of freedom.Â

Â

Why this comparison? Because it provides an insight into the strong and weakÂ scalability characteristicsÂ of each approach.

Â

**Evaluation ofÂ Performance Dependence on Mesh Size**

Â

In the above example we used a rather large mesh with 320,000 elements. This was chosen so that compute time would dominate start up time for the kernels and cuBLAS calls. In this section we assess how well the the cuBLAS and custom kernels perform for a range of mesh sizes, with element counts E chosen fromÂ [1,000 2,000 4,000 8,000 16,000 32,000 64,000 128,000 256,000 512,000]. DOES in the figures refers to DOFS=Np*E.

Â

Performance on Titan V: The figures below show DOFs plotted against GFLOPS for cuBLAS in single then double precision on an NVIDIA Titan V card:

Â ... and for our custom approach:

Â

Looking at the figures, one should be able to observeÂ that the custom approach outperforms cuBLAS-based approach for FP32 and for polynomial degrees 1 through 6 for FP64. For N=7Â andÂ Â N=8 both approaches are moreÂ less equivalent.Â

Â

Performance onÂ Volta V100:Â What we do expect to see: the conclusions (for FP64) should not change much from the Titan V resultsÂ but the numbers might.

What do we see:Â the best performance in terms of GFLOPS is better by about 1TFLOP/s compared toÂ Titan V. The conclusion is the same as for TitanV - custom approach scales better and cuBLAS-based version starts to be competitive at N=7.

Â

Node Throughputs on V100: Although it is satisfying to see the aboveÂ strong GFLOPS/s results, it is actually more useful in practice to measure the throughput in FEM nodes per second. We denote this throughput asÂ GNODES/s (i.e. giga-nodes per second).Â

Â

The cuBLAS code achieves fromÂ approximately 5 GNODES/s at N=1 to 2 GNODES/s at N=8.

The node throughput of the custom kernelsÂ on the V100 range from 30GNODES/s at N=1 to 2GNODES/s at N=8 as shown in the following figure (notingÂ the different scale to the cuBLAS graph)

In both cases the throughput drops significantly at higher order. This is discouraging for high-order nodal finite element calculations in general.

Â

Summary:Â Although the cuBLAS implementation exercise was simpler than designing, implementing, testing, modeling, and tuning bespoke kernels it also induced significantly more data movement and required excessive temporary data storage. Finally, the tuned OKL kernels deliver strong performance at all orders in contrast to cuBLAS which underperformed until N=7.

In addition, we emphasize that in our case, we needed a two-step approach: cuBLAS function followed by chain rule kernel, which means we launch two kernels. In addition, we were forced to allocateÂ more array space to hold the intermediate results. The conclusions of thisÂ study might had beenÂ Â very different if we were able to allocate the same amount of array spaceÂ for both approaches and replace the custom bake-off approach with just one cuBLAS function.

Â

In a future blog post we will discuss how the choice of basis for high-order tetrahedraÂ can dramatically change the above results.

Our Recent Posts

Archive

Tags

I'm busy working on my blog posts. Watch this space!