Q: does it make sense to partially assembled elemental stiffness matrices for affine tetrahedral finite elements when running on a Volta class GPU ?
Background: In the reviews of our recent paper on optimizing FEM operations for hexahedral elements we were asked why not assemble the matrices. The comment was an inspiration for this post albeit in this case we are discussing tetrahedral finite elements.
A: Recall that we need to compute the following:
and the question we are asking is: is it more efficient to pre-compute the bilinear term for each element and store the entries of each elemental stiffness matrix?
There are several considerations:
#1. Each elemental stiffness matrix requires 8 x Np x Np bytes when stored in double precision. Here Np is the number of basis functions. For a standard tetrahedral element of degree N this is given by
Np = (N+1) x (N+2) x (N+3)/6
For N=1, Np=4 and for N=8, Np= 165. The storage for these two degree elements would be 128 B and about 217 KB respectively. The V100 has 16 GB, thus we could store 125M linear element matrices (with N=1) dropping to 73K degree 8 elements on a single GPU if we did not have any other storage requirements.
#2. For the matrix-free version (both in cuBLAS and in non-cuBLAS version) we effectively partially assemble and apply these matrices on the fly using reference stiffness matrices and elemental geometric factors. This requires approximately 14 x Np x Np floating point operations for each element, but since we are able to cache the reference stiffness matrices (in L1 at low order and L2 at higher order) this can be done relatively efficiently. Especially Kernels 8, 9, and 10 presented in this blog post use the cached data very efficiently.
#3. We stress out that while the partially assembled stiffness matrix approach only requires 2 x Np x Np flops for each matrix-multiplication per element, there is limited opportunity to exploit the L1 and L2 cache and prefetching to registers.
#4. As noted in our most recent blog entry the efficiency of the GPU depends strongly on the number of elements being processed. At high-order it is evident that the maximum number of elements will be relatively low and this could potentially impact performance.
To explore the relative importance of these issues we created an OCCA based kernel that exploits optimized SIMD cramming and loop unrolling to evaluate the throughput of the pre-built matrix approach compared to the on-the-fly approach we highlighted in a previous blog entry.
#5. A simple but effective performance model for these kernels assumes they are strictly memory bound since each element requires a unique matrix and performs a matrix-vector product that cannot gain performance from reusing cached data.
An estimate on the throughput in nodes per second is bound by:
GNODES/s = (device bandwidth in GB)/( sizeof(dfloat) x (Np + 2) ).
#6. There are relatively few optimization options for implementing and tuning kernels that use pre-assembled matrices since the operation reduces to streaming the matrices from device memory and caching the vector in shared memory. In the following kernel we illustrate the main optimization which is important at low order of SIMD cramming so that we use close to a multiple of 32 threads per thread-block:
NVIDIA V100 Results:
In the following figure we show FEM node throughput rates measured on an NVIDIA V100. First for the partially assembled stiffness matrix version:
Our model for the NVIDIA V100 (based on achievable bandwidth of approximately 800GB/s) predicts node throughputs
N | 1 2 3 4 5 6 7 8
Est. GNODES/s | 16.7 8.3 4.5 2.7 1.7 1.2 0.8 0.6
These are good predictors of the large mesh performance shown in the above graph.
We further recall the throughput results for our custom kernels that compute the action of the local stiffness matrices using reference element matrices and the chain rule on the fly:
We note that the achieved throughput of the partially assembled approach lags the matrix-free approach by a significant factor.
Conclusion: although the simplicity of the pre-assembled kernel is appealing it is sub-optimal on counts:
1. Throughput is limited by streaming the pre-assembled matrices.
2. The matrices require substantial storage (which limits the maximal number of elements per GPU).
3. The custom-kernel with on the fly matrix partial assembly is faster than the partially assembled approach in this specific instance.
Caveat: There is one possibility for improvement: in some applications it may be necessary to operate on more than one vector for instance in an incompressible code where temporal splitting requires the solution of three elliptic problems (one for each velocity component). In this instance the three components will cost nearly the same as one component.