BLAS (Basic Linear Algebra Subprograms) is a specification for performing multiple common, basic linear algebra routines. BLAS functions have been implemented and optimized for GPUs, and packaged in libraries (i.e., cuBLAS, CULA, and (batched) MAGMA) .
A significant part of the work performed by our finite element codes consists of operations that can be expressed as BLAS routines (i.e., dot products, vector updates, matrix-vector, and matrix-matrix multiplications). We usually develop custom kernels that combine several BLAS-type operations to reduce memory traffic or memory usage. For instance we often implement "matrix-free" operations where we apply the action of a matrix on a vector without storing the matrix explicitly.
Context: In a recent post we optimized a kernel that performs stiffness matrix action on tetrahedral elements. Alternatively, this operation can be expressed as matrix-matrix product plus a streaming kernel to apply the chain rule that relates derivatives computed with respect to local elemental coordinates to derivatives in physical coordinates. In this post we analyze how our approach compares to cuBLAS.
Implementation: The matrix action is executed as two operations:
Matrix-matrix product using DGEMM or SGEMM. Matrix multiplication of a field array q of size [Nelements x Np] with an operator matrix D of size [ Np x (Np x 7) ]:
2. Multiplication by geometric factors to apply the chain rule using a simple CUDA kernel, i.e., a matrix of size [Nelements x ( Np x 7)] is reduced to a matrix of size [Nelements x Np] by linear combination of the 7 derivatives computed using cuBLAS:
Note: Nelements = Number of elements in a FEM mesh and Np is the num of nodes per element, determined by this formula for the size of Pascal's tetrahedron of depth N
Np = (N+1)(N+2)(N+3)/6, where N is the polynomial degree.
Medium-sized mesh: We use a mesh with Nelements = 320,000 tetrahedra and we compare it with the best OCCA kernel (kernel 10) from this blog entry on a Titan V using FP32 arithmetic:
In this case, our custom kernel clearly "wins", especially for lower degrees. At the lowest orders the computation is memory bound and using cuBLAS in the above manner requires us to write out 7 intermediate values per FEM node and then read them in again in the chain rule kernel. The custom version of the kernel compute the matrix product and chain rule without saving intermediate results, while achieving roofline performance through efficient blocking.
The results are somewhat more interesting for double precision (FP64).
While our custom approach wins for low degrees, it evens out with cuBLAS for N=7 and looses with cuBLAS for N=8. At the highest order our blocking algorithm is not as flexible as that used by cuBLAS and the time spent in streaming intermediate results is insignificant compared to the time spent in computing the matrix products.
Performance comparison in relation to problem size: By degrees of freedom we understand the number of elements in the mesh multiplied by the number of nodes in the element. For example, for a polynomial degree N=1, we have 4 nodes per element. If we use a mesh with 1M elements and set polynomial degree to N=1, we get 4M degrees of freedom.
Why this comparison? Because it provides an insight into the strong and weak scalability characteristics of each approach.
Evaluation of Performance Dependence on Mesh Size
In the above example we used a rather large mesh with 320,000 elements. This was chosen so that compute time would dominate start up time for the kernels and cuBLAS calls. In this section we assess how well the the cuBLAS and custom kernels perform for a range of mesh sizes, with element counts E chosen from [1,000 2,000 4,000 8,000 16,000 32,000 64,000 128,000 256,000 512,000]. DOES in the figures refers to DOFS=Np*E.
Performance on Titan V: The figures below show DOFs plotted against GFLOPS for cuBLAS in single then double precision on an NVIDIA Titan V card:
... and for our custom approach:
Looking at the figures, one should be able to observe that the custom approach outperforms cuBLAS-based approach for FP32 and for polynomial degrees 1 through 6 for FP64. For N=7 and N=8 both approaches are more less equivalent.
Performance on Volta V100: What we do expect to see: the conclusions (for FP64) should not change much from the Titan V results but the numbers might.
What do we see: the best performance in terms of GFLOPS is better by about 1TFLOP/s compared to Titan V. The conclusion is the same as for TitanV - custom approach scales better and cuBLAS-based version starts to be competitive at N=7.
Node Throughputs on V100: Although it is satisfying to see the above strong GFLOPS/s results, it is actually more useful in practice to measure the throughput in FEM nodes per second. We denote this throughput as GNODES/s (i.e. giga-nodes per second).
The cuBLAS code achieves from approximately 5 GNODES/s at N=1 to 2 GNODES/s at N=8.
The node throughput of the custom kernels on the V100 range from 30GNODES/s at N=1 to 2GNODES/s at N=8 as shown in the following figure (noting the different scale to the cuBLAS graph)
In both cases the throughput drops significantly at higher order. This is discouraging for high-order nodal finite element calculations in general.
Summary: Although the cuBLAS implementation exercise was simpler than designing, implementing, testing, modeling, and tuning bespoke kernels it also induced significantly more data movement and required excessive temporary data storage. Finally, the tuned OKL kernels deliver strong performance at all orders in contrast to cuBLAS which underperformed until N=7.
In addition, we emphasize that in our case, we needed a two-step approach: cuBLAS function followed by chain rule kernel, which means we launch two kernels. In addition, we were forced to allocate more array space to hold the intermediate results. The conclusions of this study might had been very different if we were able to allocate the same amount of array space for both approaches and replace the custom bake-off approach with just one cuBLAS function.
In a future blog post we will discuss how the choice of basis for high-order tetrahedra can dramatically change the above results.