2020 GPU Hackathon (Sheffield)

We are very happy to participate with a first version of ExaClaw based upon ExaHypE 2/Peano 4 in the 2020 Sheffield GPU Hackathon. We summarised our major lessons learned in a brief report.

Vision and work programme

We follow two goals for this hackathon:

  1. We want to understand how we can enable generic ExaHyPE user code (written in C) to run on GPUs. For this, we have taken a solver for the 2d Euler and manually made it run with OpenMP 5. This required quite a lot of manual inlining, the removal of lambdas, and the elimination of virtual function calls. It is one of the goals of this workshop to understand how much of this manual work could eventually be done by a compiler. We will use the lessons learned to design our own ExaHyPE 2 precompiler which does all the code conversions that will not be taken up by future tool generations.
  2. We want to run at least one feasibility study how we can combine task-based parallelism on the main node with (multi-)GPU usage. Is it possible to use a GPU with multiple tasks accessing it in a “random” fashion? Can we derive first guidelines how to load-balance between the devices?

So our agenda pretty much orbits around feasibility studies. We are neither focusing on performance engineering at this stage nor on production runs.

Code base and toolchain

We will stick to the 2d and 3d Euler equations for the hackathon. In the best-case, we plan to run the LOH.1 benchmark throughout the Hackathon. For all tests, we run only Finite Volume schemes on a regular mesh to keep the numerics simple. MPI + X + GPGPU would be nice, but the primary vision is to get X+GPGPU up and running.

Durham GPU workstations (local)

We use the following packages:

gcc v10.1.1-1.fc32
gcc-offload-nvptx.x86_64 v10.1.1-1.fc32 (Offloading compiler to NVPTX)
libgomp-offload-nvptx.x86_64 v10.1.1-1.fc32 (GCC OpenMP v4.5 plugin for offloading to NVPTX)
NVIDIA CUDA Toolkit 10.2

NVIDIA DGX cluster (via Okta)

We basically follow our guidebook installation instructions. The important thing is to use the compute nodes, as only compute nodes have access to the whole module environment required:

sft ssh raplab-hackathon
git clone --branch p4 https://gitlab.lrz.de/hpcsoftware/Peano.git
srun --nodes=1 --partition=batch --pty /bin/bash

module load Bundle/gnu/10.1.0-gpu

cd Peano
libtoolize; aclocal; autoconf; autoheader; cp src/config.h.in .; automake --add-missing
./configure --enable-exahype --with-multithreading=omp --with-nvidia CXXFLAGS=-DUseLogService=NVTXLogger
make clean
make -j32

The snippet installs our latest code version without the GPU parts. It is the master branch of Peano’s fourth generation (therefore p4), and the ultimate goal is to integrate the GPGPU stuff (see below) into this master. The master is ahead of this GPGPU branch w.r.t. functionality and MPI support. So switching forth and back might be required.

Peano 4/ExaHyPE 2 rely heavily on a Python front-end which in turn requires libraries such as numpy or Jinja2. These are not available per default, so I recommend to use a virtual environment and to install it in the user space:

module load gcc/10.1.0/python/3.7.7
python3 -m venv $HOME/peano-python-api
source $HOME/peano-python-api/bin/activate
pip install jinja2
pip install numpy

Once all of these steps have been done, you can skip the environment creation and jump straight into the Python environment after you’ve logged in:

srun --nodes=1 --partition=batch --pty /bin/bash

module load Bundle/gnu/10.1.0-gpu
source $HOME/peano-python-api/bin/activate

TesT code Base

For all of our tests throughout the hackathon, we run the simple Euler 2d setup. So please change into /python/examples/exahype2/euler and type in the following steps:

export PYTHONPATH=../../..
python3 example-scripts/finitevolumes-with-ExaHyPE2-benchmark.py

You will get a usage message (hopefully) which tells you how to kick off a series of measurements. Please note that the instructions. Once you pass in the arguments and rerun the script, you should get a peano4 executable which you can run on your compute node:

python3 example-scripts/finitevolumes-with-ExaHyPE2-benchmark.py --trees-per-core=0.7 --h 0.01

You then can run the code via

./peano4 --threads 8

Peano 4 can set the number of threads manually from within the code. I tried for example four threads and got a runtime of 258.8s. If I use 8 as argument, then the code runs 305.6s. So something is not working. Indeed, we validate with one thread and get 273.7s. So the code is not multithreading but suffers from domain decomposition overhead if we tell it to use more threads.

The explanation is not surprising and actually is reported by the code: The thread-level for OpenMP is set invalid. So I can alter this one via OMP_NUM_THREADS=48, but the CPU mask seems to remain unaltered, as I still do not see any speedup. I only get around 200% core usage, which is (maybe) due to some hyperthreading thing. I definitely don’t use all cores … So it is important that you tell SLURM right from the start how many cores you wanna use (Peano might however be able to reduce these guys via the command line argument):

srun --nodes=1 --cpus-per-task=12 --partition=batch --pty /bin/bash

Bigger problem sizes can be constructed by passing smaller h values into the Python script. With the argument –trees-per-core, you can alter the number of cores that the code tries to occupy with computational subdomains vs the cores that are used only for processing tasks that arise from these domains.


To enable the custom profiling, you need CUDA (11), you have to link against the trace version of the code, and you need a build that links against the NVTX library. Otherwise, the traces will be difficult to digest. So, first ensure that your configure is passed the arguments that tell Peano to build against NVTX:

./configure --enable-exahype --with-multithreading=omp --with-nvidia CXXFLAGS="-DUseLogService=NVTXLogger" LDFLAGS="-L/mnt/shared/sw-hackathons/cuda-sdk/cuda-10.1/lib64 -lnvToolsExt"

Peano always builds multiple versions of the core libraries when you compile. There’s a release version, there’s a debug version, and there’s also a trace version. With the –with-nvidia command, the latter’s traces are not written into a file but piped into the NVTX library. Finally, I thus edit the Python script and set the build_mode to Trace. So now we link against the trace version of the lib. A simple

module load cuda/11.0.2
nsys profile -o timeline --trace=nvtx ./peano4

now should give you a meaningful timeline (I don’t know why nsys is not part of CUDA 10 and hope there’s no incompatibility between the two CUDA modules/tool sets). However, Peano implements its own trace filtering. To ensure that your code does write trace into, open the file exahype.log-filter and ensure that the trace entries of interest are set to whitelist entries. If there’s no such file, then all trace info is masked out. I recommend that you set all trace information to black besides the stuff on enclave tasks:

trace tarch -1 black
trace peano4 -1 black
trace examples -1 black
trace exahype2 -1 black
trace toolbox -1 black

trace exahype2::EnclaveTask -1 white
trace exahype2::fv          -1 black

My only really useful traces stem from a regular grid experiment with a mesh size of 0.03. Coarser meshes lack any interesting dynamics, finer meshes last for ages.

OpenMP 5 + GPU Version

Before you start with GPUs, request a GPU when you log into the compute node:

srun --nodes=1 --cpus-per-task=12 --partition=batch --gres=gpu:1 --pty /bin/bash

Furthermore, load CUDA 10:

module load Bundle/gnu/10.1.0-gpu
module load cuda/10.1

I faced severe issues when I still had CUDA 11 loaded (even though you need that one for the performance analysis), so ensure you purge the environment before you build stuff. Next, open your spec file and ensure that your solver is the enclave solver. Furthermore, set the GPU flag to to True:

    use_gpu = True

Rerun Python and execute the code. Before you do so, please ensure that your compile is properly configures:

./configure --enable-exahype --with-multithreading=omp --with-nvidia CXXFLAGS="-DUseLogService=NVTXLogger -foffload=-lm -fno-fast-math -fno-associative-math" LDFLAGS="-L/mnt/shared/sw-hackathons/cuda-sdk/cuda-10.1/lib64 -lnvToolsExt" 

This is an interesting set of options. I tried a couple of more w.r.t. offloading. Some of them seem to be required, others make the code fail:

-fno-ltoLinker (LDFLAGGS)Don’t use it. Without this flag, nothing goes to the GPU, even if the code uses offloading
-foffload=nvptx-noneShould not be required, as this should be part of the default compiler configurationCompiler
-fno-exceptionsDoes not seem to make a difference.
-fno-devirtualizeDoes not seem to make a difference.


Build against math routines fitting to GPULinkerI need these flags, as the flux and eigenvalues use things alike a square root.
-fopt-info-optimized-ompNo clear how this works with OpenMP offloading. Skip it

Attention: I repeatedly had issues where no GPU outsourcing had been done and I did recompile after recompile. It turned out that I got logged out and then forgot to call srun with a request for a GPU!

Once you have profiles your code (ensure you add openmp and cuda as further profile targets), you should get something alike

Code changes/design

Peano phrases the operations per cell as tasks. It distinguishes these tasks according to their position: There are skeleton tasks which are latency-critical. These are tasks along MPI boundaries or along refinement transitions. We do not spawn these tasks as actual tasks but process them right away. The remaining cells in the domain phrase enclave tasks. These are tasks that we spawn as real tasks which are then deployed to background OpenMP or TBB threads/tasks.

The enclave tasks can be processed in any order, and they are not time-critical. Consequently, we plan to offload these guys to the GPGPU. Each task – if enclave or skeleton – invokes a computational kernel. In the present hackathon, the kernel will be the Finite Volume computation. When we invoke the kernel, we know whether we come from an enclave or skeleton task. The vision thus is that we use a simple if over the skeleton predicate which determines whether a task goes to the GPU or not.

Once we benchmark our code, it becomes apparent that we suffer from a poor task deployment with standard OpenMP.