The power of Lo(o)sing Control – When does a re-implementation of mature simulation fragments with HPC DSLs pay off?

This ISC workshop has been featured by HPC Wire.

Conference: ISC High Performance
Date: 28 June 2018, 2pm-6pm
Organisers: Tobias Weinzierl, Michael Bader
Venue: Frankfurt Marriott Hotel, room Lux (5th floor)

Exascale roadmaps tend to postulate a need for revolutionary software rewrites. Otherwise, so they claim, software will not be able to scale to the next level. One of these revolutions is to translate applications into tasks with dependencies; the task language here can be read as DSL. Another one is the replacement of intrinsics, SIMD pragmas, stencil codes, and so forth with purpose-built DSLs. Such rewrites however are expensive in terms of development time.

We invite speakers developing exascale codes to summarise their experiences and expectations w.r.t. novel programming techniques. The project-driven talks are complemented by provocations by task middleware and DSL developers. Yet, we do not aim for a comprehensive techniques overview. Instead, the workshop is to sketch answers to the following questions:

  • Where is the pain barrier where consortia are willing to rewrite major code parts of their codes? Are these techniques well-suited for libraries, core components or the whole application?
  • To which degree are mature simulation codes willing to give up control of SIMDzation, task scheduling, data placement, and so forth?
  • What metrics or characteristics should new paradigms expect their user codes to hand over to the DSL/runtime (arithmetic intensity, affinity, dependencies)?

Speakers:

Michael Bader (Technische Universität München): Predictive Load Balancing vs. Reactive Work Stealing – Parallel Adaptive Mesh Refinement and the Chameleon Project

One of the predictions on exascale hardware claims that equal work load will no longer lead to equal execution time. Hence, applications will need to react adaptively to load imbalances at runtime despite best efforts in predictive load distribution. The Chameleon project (www.chameleon-hpc.org) aims at providing a respective reactive programming environment, by small augmentations to the standard MPI and OpenMP programming models. For applications, a crucial question will be how much this reactive behaviour has to be triggered by the application itself, and to what extent it can be supported by the programming environment – constrained, of course, by how much extra programming effort and to what extent intrusive changes to the existing application are desired. In my talk I will present the interplay of predictve load balancing and reactive work stealing in the parallel AMR code sam(oa)², and how these approaches shall be supported by the intended Chameleon environment.

Richard Bower (Durham University): SWIFT – the benefits of re-implementing Gadget, the astrophysics benchmark

Nathan D. Ellingwood (Sandia): Lessons Learned: Experiences from Introducing the Kokkos Programming Model into Legacy Applications

For quite some time the computer science community has been working on new high level parallel programming models with one of its main sales pitches being to make ‘Next Generation Platforms’ (NGP) accessible to a wide range of applications and their developers. With machines such as Sierra and Summit slated to come online this summer, prototypes of upcoming exascale platforms are now here and views of adopting a parallel programming model have changed from ‘Why would I want to deal with this?’ to ‘I need a viable solution now!’ to enable legacy codes and large applications to utilize the new platforms. The Kokkos team is offering one such solution, providing a parallel programming ecosystem in C++, which supports the major publicly available HPC platforms and promises to isolate application teams from future architecture changes.

In this talk, a short overview of the Kokkos EcoSystem (consisting of KokkosCore, KokkosKernels, KokkosTools and KokkosSupport) will be provided and how it addresses the requirements of legacy applications for a transition to exascale architectures. The presentation will also discuss ‘lessons learned’ from the ongoing porting effort of the numerous applications adopting Kokkos, with a focus on issues, which were not encountered in proxy app experiments.

Thomas Fahringer (University Innsbruck): t.b.c.

Contemporary parallel programming approaches often rely on well-established parallel libraries and/or language extensions to address specific HW resources that can lead to mixed parallel programming paradigms. In contrast to these approaches, AllScale proposes a C++ template based approach to ease the development of scalable and efficient general-purpose parallel applications. Applications utilize a pool of parallel primitives and data structures for building solutions to their domain-specific problems. HPC experts who provision high level, generic operators and data structures for common use cases, design these parallel primitives. The supported set of constructs may range from ordinary parallel loops, over stencil and distributed graph operations as well as frequently utilized data structures including (adaptive) multidimensional grids, trees, and irregular meshes, to combinations of data structures and operations like entire linear algebra libraries. This set of parallel primitives is implemented using pure C++ and may be freely extended by third party developers, similar to conventional libraries in C++ development projects. One of the peculiarities of AllScale is its main source of parallelism that is based on nested recursive task parallelism. Sophisticated compiler analysis determines the data needed for every task which is of paramount importance to achieve performance across a variety of parallel architectures.

Harald Köstler (Universität Erlangen-Nürnberg): Code generation approaches for HPC

In the last years various different approaches for increasing productivity and portability of HPC codes were considered. Usually either external or embedded domain-specific languages are developed, where performance of the resulting implementations is of course very important for many applications from computational science and engineering. In my talk I will give a short overview of some prominent existing code generation frameworks and discuss their advantages and disadvantages. Then, I will present our code generation framework ExaStencils written in Scala in more detail. It allows whole program generation with several back-ends like C++ and CUDA for a restricted class of applications that can be described by partial differential equations on structured grids. Furthermore, applications from fluid dynamics and geosciences will be shown including their performance on CPU and GPU clusters.

Martin Kronbichler (Technische Universität München): High-performance finite element computations with the deal.II finite element library

I will present HPC challenges from the perspective of high-order finite element computations within the generic deal.II finite element library. Besides a wide range of finite element functionality, the deal.II library comes with a flexible matrix-free operator evaluation evaluation infrastructure based on fast quadrature with sum factorization techniques whose design has been guided by high performance computing principles. The two central ingredients are efficient arithmetics by explicit SIMD instructions as well as memory-lean data structures. Our work has shown that these concepts clearly outperform sparse matrix kernels for applications using quadratic or higher-degree shape functions, rendering performance close to the underlying hardware limits for both continuous and discontinuous elements possible. Despite its efficiency, the deal.II implementation is accessible to the programmer with the full flexibility of the C++ programming language.

There are several challenges that we constantly need to address from a library perspective, such as the need to cover a wide range of differential operators, polynomial degrees, and mesh configurations, that prevent us from adopting problem-specific solutions. Furthermore, the high-performance operator evaluation must be integrated into a solver stack including explicit time integrators or multigrid linear and nonlinear solvers. This integration is particularly intrigued, because in the typical memory-bandwidth-constrained setting performance can often only be gained by merging operations across the boundaries of mathematical operations, such as merging vector updates and inner products with operator evaluation. This goes somewhat against what many scientific software writers consider best practice today, namely the concept of splitting algorithms into different components or even different libraries to reduce the amount of code that must be maintained.