Vision and work programme
We follow two goals for this hackathon:
- We want to understand how we can enable generic ExaHyPE user code (written in C) to run on GPUs. For this, we have taken a solver for the 2d Euler and manually made it run with OpenMP 5. This required quite a lot of manual inlining, the removal of lambdas, and the elimination of virtual function calls. It is one of the goals of this workshop to understand how much of this manual work could eventually be done by a compiler. We will use the lessons learned to design our own ExaHyPE 2 precompiler which does all the code conversions that will not be taken up by future tool generations.
- We want to run at least one feasibility study how we can combine task-based parallelism on the main node with (multi-)GPU usage. Is it possible to use a GPU with multiple tasks accessing it in a “random” fashion? Can we derive first guidelines how to load-balance between the devices?
So our agenda pretty much orbits around feasibility studies. We are neither focusing on performance engineering at this stage nor on production runs.
Code base and toolchain
We will stick to the 2d and 3d Euler equations for the hackathon. In the best-case, we plan to run the LOH.1 benchmark throughout the Hackathon. For all tests, we run only Finite Volume schemes on a regular mesh to keep the numerics simple. MPI + X + GPGPU would be nice, but the primary vision is to get X+GPGPU up and running.
Durham GPU workstations (local)
We use the following packages:
gcc v10.1.1-1.fc32 gcc-offload-nvptx.x86_64 v10.1.1-1.fc32 (Offloading compiler to NVPTX) libgomp-offload-nvptx.x86_64 v10.1.1-1.fc32 (GCC OpenMP v4.5 plugin for offloading to NVPTX) NVIDIA CUDA Toolkit 10.2
NVIDIA DGX cluster (via Okta)
We basically follow our guidebook installation instructions. The important thing is to use the compute nodes, as only compute nodes have access to the whole module environment required:
sft ssh raplab-hackathon git clone --branch p4 https://gitlab.lrz.de/hpcsoftware/Peano.git srun --nodes=1 --partition=batch --pty /bin/bash module load Bundle/gnu/10.1.0-gpu cd Peano libtoolize; aclocal; autoconf; autoheader; cp src/config.h.in .; automake --add-missing ./configure --enable-exahype --with-multithreading=omp --with-nvidia CXXFLAGS=-DUseLogService=NVTXLogger make clean make -j32
The snippet installs our latest code version without the GPU parts. It is the master branch of Peano’s fourth generation (therefore p4), and the ultimate goal is to integrate the GPGPU stuff (see below) into this master. The master is ahead of this GPGPU branch w.r.t. functionality and MPI support. So switching forth and back might be required.
Peano 4/ExaHyPE 2 rely heavily on a Python front-end which in turn requires libraries such as numpy or Jinja2. These are not available per default, so I recommend to use a virtual environment and to install it in the user space:
module load gcc/10.1.0/python/3.7.7 python3 -m venv $HOME/peano-python-api source $HOME/peano-python-api/bin/activate pip install jinja2 pip install numpy
Once all of these steps have been done, you can skip the environment creation and jump straight into the Python environment after you’ve logged in:
srun --nodes=1 --partition=batch --pty /bin/bash module load Bundle/gnu/10.1.0-gpu source $HOME/peano-python-api/bin/activate
TesT code Base
For all of our tests throughout the hackathon, we run the simple Euler 2d setup. So please change into /python/examples/exahype2/euler and type in the following steps:
export PYTHONPATH=../../.. python3 example-scripts/finitevolumes-with-ExaHyPE2-benchmark.py
You will get a usage message (hopefully) which tells you how to kick off a series of measurements. Please note that the instructions. Once you pass in the arguments and rerun the script, you should get a peano4 executable which you can run on your compute node:
python3 example-scripts/finitevolumes-with-ExaHyPE2-benchmark.py --trees-per-core=0.7 --h 0.01
You then can run the code via
./peano4 --threads 8
Peano 4 can set the number of threads manually from within the code. I tried for example four threads and got a runtime of 258.8s. If I use 8 as argument, then the code runs 305.6s. So something is not working. Indeed, we validate with one thread and get 273.7s. So the code is not multithreading but suffers from domain decomposition overhead if we tell it to use more threads.
The explanation is not surprising and actually is reported by the code: The thread-level for OpenMP is set invalid. So I can alter this one via OMP_NUM_THREADS=48, but the CPU mask seems to remain unaltered, as I still do not see any speedup. I only get around 200% core usage, which is (maybe) due to some hyperthreading thing. I definitely don’t use all cores … So it is important that you tell SLURM right from the start how many cores you wanna use (Peano might however be able to reduce these guys via the command line argument):
srun --nodes=1 --cpus-per-task=12 --partition=batch --pty /bin/bash
Bigger problem sizes can be constructed by passing smaller h values into the Python script. With the argument –trees-per-core, you can alter the number of cores that the code tries to occupy with computational subdomains vs the cores that are used only for processing tasks that arise from these domains.
To enable the custom profiling, you need CUDA (11), you have to link against the trace version of the code, and you need a build that links against the NVTX library. Otherwise, the traces will be difficult to digest. So, first ensure that your configure is passed the arguments that tell Peano to build against NVTX:
./configure --enable-exahype --with-multithreading=omp --with-nvidia CXXFLAGS="-DUseLogService=NVTXLogger" LDFLAGS="-L/mnt/shared/sw-hackathons/cuda-sdk/cuda-10.1/lib64 -lnvToolsExt"
Peano always builds multiple versions of the core libraries when you compile. There’s a release version, there’s a debug version, and there’s also a trace version. With the –with-nvidia command, the latter’s traces are not written into a file but piped into the NVTX library. Finally, I thus edit the Python script and set the build_mode to Trace. So now we link against the trace version of the lib. A simple
module load cuda/11.0.2 nsys profile -o timeline --trace=nvtx ./peano4
now should give you a meaningful timeline (I don’t know why nsys is not part of CUDA 10 and hope there’s no incompatibility between the two CUDA modules/tool sets). However, Peano implements its own trace filtering. To ensure that your code does write trace into, open the file exahype.log-filter and ensure that the trace entries of interest are set to whitelist entries. If there’s no such file, then all trace info is masked out. I recommend that you set all trace information to black besides the stuff on enclave tasks:
trace tarch -1 black trace peano4 -1 black trace examples -1 black trace exahype2 -1 black trace toolbox -1 black trace exahype2::EnclaveTask -1 white trace exahype2::fv -1 black
My only really useful traces stem from a regular grid experiment with a mesh size of 0.03. Coarser meshes lack any interesting dynamics, finer meshes last for ages.
OpenMP 5 + GPU Version
Before you start with GPUs, request a GPU when you log into the compute node:
srun --nodes=1 --cpus-per-task=12 --partition=batch --gres=gpu:1 --pty /bin/bash
Furthermore, load CUDA 10:
module load Bundle/gnu/10.1.0-gpu module load cuda/10.1
I faced severe issues when I still had CUDA 11 loaded (even though you need that one for the performance analysis), so ensure you purge the environment before you build stuff. Next, open your spec file and ensure that your solver is the enclave solver. Furthermore, set the GPU flag to to True:
project.add_solver( exahype2.solvers.GenericRusanovFVFixedTimeStepSizeWithEnclaves( ..., use_gpu = True ))
Rerun Python and execute the code. Before you do so, please ensure that your compile is properly configures:
./configure --enable-exahype --with-multithreading=omp --with-nvidia CXXFLAGS="-DUseLogService=NVTXLogger -foffload=-lm -fno-fast-math -fno-associative-math" LDFLAGS="-L/mnt/shared/sw-hackathons/cuda-sdk/cuda-10.1/lib64 -lnvToolsExt"
This is an interesting set of options. I tried a couple of more w.r.t. offloading. Some of them seem to be required, others make the code fail:
|-fno-lto||Linker (LDFLAGGS)||Don’t use it. Without this flag, nothing goes to the GPU, even if the code uses offloading|
|-foffload=nvptx-none||Should not be required, as this should be part of the default compiler configuration||Compiler|
|-fno-exceptions||Does not seem to make a difference.|
|-fno-devirtualize||Does not seem to make a difference.|
|Build against math routines fitting to GPU||Linker||I need these flags, as the flux and eigenvalues use things alike a square root.|
|-fopt-info-optimized-omp||No clear how this works with OpenMP offloading. Skip it|
Attention: I repeatedly had issues where no GPU outsourcing had been done and I did recompile after recompile. It turned out that I got logged out and then forgot to call srun with a request for a GPU!
Once you have profiles your code (ensure you add openmp and cuda as further profile targets), you should get something alike
Peano phrases the operations per cell as tasks. It distinguishes these tasks according to their position: There are skeleton tasks which are latency-critical. These are tasks along MPI boundaries or along refinement transitions. We do not spawn these tasks as actual tasks but process them right away. The remaining cells in the domain phrase enclave tasks. These are tasks that we spawn as real tasks which are then deployed to background OpenMP or TBB threads/tasks.
The enclave tasks can be processed in any order, and they are not time-critical. Consequently, we plan to offload these guys to the GPGPU. Each task – if enclave or skeleton – invokes a computational kernel. In the present hackathon, the kernel will be the Finite Volume computation. When we invoke the kernel, we know whether we come from an enclave or skeleton task. The vision thus is that we use a simple if over the skeleton predicate which determines whether a task goes to the GPU or not.
Once we benchmark our code, it becomes apparent that we suffer from a poor task deployment with standard OpenMP.