Taking Notes

LOW DOWN ON HIGH-ORDER BLOG

 

Our Recent Posts

Archive

Tags

No tags yet.

Finite Element Stiffness Matrix Action: to precompute or not to precompute

Q: does it make sense to partially assembled elemental stiffness matrices for affine tetrahedral finite elements when running on a Volta class GPU ? Background: In the reviews of our recent paper on optimizing FEM operations for hexahedral elements we were asked why not assemble the matrices. The comment was an inspiration for this post albeit in this case we are discussing tetrahedral finite elements. A: Recall that we need to compute the following: and the question we are asking is: is it more efficient to pre-compute the bilinear term for each element and store the entries of each elemental stiffness matrix? There are several considerations: #1. Each elemental stiffness matrix requires 8

Finite Element Stiffness Matrix Action: to BLAS or not to BLAS, that is the question.

BLAS (Basic Linear Algebra Subprograms) is a specification for performing multiple common, basic linear algebra routines. BLAS functions have been implemented and optimized for GPUs, and packaged in libraries (i.e., cuBLAS, CULA, and (batched) MAGMA) . A significant part of the work performed by our finite element codes consists of operations that can be expressed as BLAS routines (i.e., dot products, vector updates, matrix-vector, and matrix-matrix multiplications). We usually develop custom kernels that combine several BLAS-type operations to reduce memory traffic or memory usage. For instance we often implement "matrix-free" operations where we apply the action of a matrix on a vector wi

Finite Element Stiffness Matrix Action: monolithic kernel optimization on Titan V

In this post we demonstrate how to optimize the performance of a specific finite element operation expressed as a GPU kernel. One might ask: why are we doing this? Why do we care so much about optimizing performance? After carefully reading this post, it should be clear that there is a substantial difference in performance between basic GPU code and highly optimized GPU code. As our example we use a simple, yet demanding (high arithmetic intensity) matrix-vector kernel coming from FEM. The piece of code in question is executed multiple times per time step for time dependent flow calculations that might require hundreds of thousands of time steps, hence we would like to make it as fast as pos

 

225 Stanger St
Blacksburg, VA 24061
USA.

  • Twitter
  • LinkedIn

©2018 by the Parallel Numerical Algorithms  Research Group @VT.