New machines such as ARCHER2 allow users to solve more complex and more extensive problems than ever before. The full potential of ARCHER2 will only be reached if the machine’s considerable boosts in parallel compute capacity are matched by advancements on the software side, foreseeably relying on multifaceted (nested) parallelism. ARCHER2 nodes boast a very high core count of 128, spanning a total of 8 NUMA domains. One key to efficient usage of the machine is inevitably a shared-memory parallelisation that can adequately deal with NUMA effects, high core counts and support multiple layers of nested parallelism (tasks within BSP within tasks within MPI, e.g.). Without that, limited strong scaling and detrimental NUMA effects that exist on current two-socket or single-socket systems are amplified on ARCHER2, resulting in disappointing code performance. It is further safe to assume that the current trend of ever more cores and NUMA domains will carry over to the next generations of machines, since future increases in parallelism will primarily be due to an increasing number of cores per node and not due to an increasing number of nodes.
- Develop a strategy how to pin tasks in ExaHyPE to the right cores. We plan to roll out this strategy for OpenMP.
- Use more than one core per task. This feature is currently not supported within OpenMP. In the tradition of SYCL/OneTBB, we might thus have to rewrite our data-parallel compute kernels such that they spawn further tasks instead of rely on a sole parallel-for. Such nested parallelism makes the first goal challenging.
- Make the task scheduler NUMA aware. Guided by a cost model, the scheduler has to balance between NUMA penalties and concurrency. It might, for example, be better to postpone a task execution or to leave cores idle if this trashes all caches due to NUMA.
Yet to be written.
- Holger Schulz, Gonzalo Brito Gadeschi, Oleksandr Rudyy, Tobias Weinzierl: Task inefficiency patterns for a wave equation solver. IWOMP 2021