CHIUW 2021
The 8th Annual
Chapel Implementers and Users Workshop
Friday June 4th, 2021
8:00am–4:00pm PDT (GMT–7)
free and online via Zoom
Introduction: CHIUW 2021 is the 8th annual Chapel Implementers and Users Workshop. CHIUW serves as a forum where users and developers of the Chapel language (chapel-lang.org) can gather to report on work being done with Chapel, exchange ideas, and forge new collaborations. Anyone interested in parallel programming and Chapel is encouraged to attend CHIUW, from long-term enthusiasts to those simply curious to learn more.
Format: Due to Covid-19, CHIUW 2021 will be held online in a virtual workshop format. Talks will be given by speakers, either live or via pre-recorded videos (linked below when available). Each talk will be followed by a short question-and-answer session. Short breaks between speakers and sessions will be used to deal with any challenges that come up as a result of the distributed setting. Due to the large number of timezones involved, there will not be any formal meal breaks, but you're encouraged to eat while watching the talks or during the breaks.
Registration: Registration for CHIUW 2021 is free and can be completed at this link
Time (PDT) | |
anytime: | Chapel 101 [slides | video] |
Brad Chamberlain (Hewlett Packard Enterprise) | |
This is a completely optional talk for those who are new to Chapel and looking for a crash-course, or for those who would simply appreciate a refresher. | |
8:00–8:30: | Welcome [slides | video], State of the Project [slides | video w/ Q&A] |
Engin Kayraklioglu, Brad Chamberlain (Hewlett Packard Enterprise) | |
This session will serve as a welcome to and overview of CHIUW 2021, along with a brief summary of highlights and milestones achieved within the Chapel project since last year. | |
|
|
8:30–8:45: | Break: We'll use this break to make sure that the streaming technology is generally working for people before proceeding. |
Session chair: Simon Bourgault-Côté (Polytechnique Montreal) |
|
8:45–9:05 | Planned Improvements to the Chapel Compiler [submission | slides | video w/ Q&A] |
Michael Ferguson (Hewlett Packard Enterprise) | |
Abstract: The current architecture of the Chapel compiler makes it difficult to add new features such as separate compilation and IDE integration that are frequently requested by Chapel users. This talk will discuss plans for an improved compiler architecture that can better support these features. | |
9:05–9:25: | Recent InfiniBand Optimizations in Chapel [submission | slides | video w/ Q&A] |
Elliot Ronaghan (Hewlett Packard Enterprise) | |
Abstract: This talk will highlight recent optimizations that have significantly improved Chapel's performance and scalability on InfiniBand systems. Enhancements to the memory registration implementation have improved the performance of several core benchmarks and user applications including Arkouda, a Python package backed by Chapel that provides a key subset of the NumPy and Pandas interfaces. Performance results for core benchmarks will be shown on a small-scale InfiniBand cluster and Arkouda results will be shown on a 240 node InfiniBand-based HPE Apollo system. | |
9:25–9:45: | Locality-Based Optimizations in the Chapel Compiler [submission | slides | video w/ Q&A] |
Engin Kayraklioglu and Elliot Ronaghan (Hewlett Packard Enterprise) | |
Abstract: In recent releases, we have added two locality-based optimizations to the Chapel compiler. These optimizations enable the compiler to statically determine the locality of array accesses and to aggregate fine-grained copy operations. In this talk, we summarize how they are implemented, their impact on various programming idioms, associated performance improvements, and pertinent future directions. | |
9:45–10:00: | Break |
Session chair: Brad Chamberlain (HPE) |
|
10:00–11:00: | HPC Lessons from 30 Years of Practice in CFD Towards Aircraft Design and Analysis [slides | video w/ Q&A] |
Éric Laurendeau (Polytechnique Montreal) | |
Abstract:
Aircraft design and analysis have, historically, always been a
technology pull whether in the defense or commercial sectors.
Computational Fluid Dynamics has been, since the 1960’s, a
technology push by scientists. Taken together, they have been a
formidable technology pull for High Performance Computing which
itself, has been pushed by the incredible rise in computing
capabilities.
The talk will present a crash course in aircraft design and in particular the aerodynamic analysis viewed through the various mathematical models. The discretization of these into coupled systems will present some challenges. The impact of HPC on industrial processes, including aircraft certification, will be discussed. Some case studies will demonstrate the need for ever increased software and hardware developments as exemplified by NASA’s current vision for 2030. To that end, contributions from Polytechnique Montreal will show the impact of the Chapel language for large-scale problems (billions of unknowns), in its performance and also its ease of use. |
|
Bio: Éric Laurendeau is Professor in the Department of Mechanical Engineering at École Polytechnique Montréal. A graduate from McGill U. (Montréal, Canada), he obtained his D.E.A. at ISAE-SupAéro (Toulouse, France) and Ph.D. from the U. of Washington (Seattle, USA). He worked (1996-2005) and led (2005-2011) aerodynamic R&D activities within the Advanced Aerodynamics Department at Bombardier Aerospace with applications towards business (Challenger and Global lines), regional (CRJ) and commercial (C-Series, now Airbus A220) jets. His research interests are in the area of Computational Fluid Dynamics (past president of the CFD Society of Canada) and High-Performance Computing towards the study of aerodynamic flows over aircraft configurations. He holds the Canada Research Chair in ‘Modeling and control of unsteady aircraft aerodynamics’ and the NSERC/CRIAQ/Bombardier Aerospace Industrial Research Chair in "Interdisciplinary Aerothermodynamic Analysis and Design Methods for Transport Aircraft". He sits on Compute Canada Advisory Council on Research and Calcul Québec Scientific Committee. He serves on Aéro-Montréal ‘innovation monitoring and strategy working group’, a think-tank aimed at developing a strategy for aerospace innovation in Québec housing 75% of Canada’s Aerospace R&D. | |
11:00–11:15: | Break |
Session chair: Rich Vuduc (Georgia Tech) |
|
11:15–11:35: | Development of an Aircraft Aero-Icing Suite Using Chapel Programming Language [submission | slides | video | Q&A] |
Hélène Papillon Laroche, Simon Bourgault-Côté, Matthieu Parenteau and Éric Laurendeau (Polytechnique Montreal) | |
Abstract: This paper presents an aircraft ice accretion simulation suite implemented in the Chapel programming language for deterministic and stochastic ice accretion in two (2D) and three (3D) dimensions. The work is performed inside the CHApel Multi-Physics Simulation software (CHAMPS) developed at Polytechnique Montreal since 2019. Different physical models are added to the flow solver to simulate the droplet trajectories, the surface thermodynamic exchanges, and the surface deformation. The object-oriented approach used in the development of CHAMPS, combined with the generic functions and types from Chapel, allowed the development of a code that is easy to maintain and that still has high growth potential. The latest extension to CHAMPS is the capability to perform stochastic ice accretion using an advancing front grid methodology at the core and by randomly distributing the droplets, like in a cloud. Although stochastic ice accretion is not new, this paper presents an original methodology that has advantages over other methods from the literature, such as conserving a valid surface mesh from the beginning to the end of the stochastic accretion. Multi-layer ice accretion results are presented in 2D and 3D for a deterministic methodology, whereas single-layer 2D results are presented for the stochastic method. | |
11:35–11:50: | Towards Ultra-scale Optimization Using Chapel [submission | slides | video | Q&A] |
Tiago Carneiro (University of Luxembourg) and Nouredine Melab (INRIA Lille) | |
Abstract:
Tree-based search algorithms applied to combinatorial optimization
problems are highly irregular and time-consuming when it comes to
solving big instances. Due to their highly parallel nature,
algorithms of this class have been revisited for different parallel
architectures over the years. These parallelization efforts have
always been guided by the performance objective setting aside
productivity.
However, dealing with scalability implicitly induces the heterogeneity issue, which means that different programming models/languages, runtimes and libraries need to be employed ensemble for efficiently exploiting all levels of parallelism of large-scale systems. As a consequence, efforts towards productivity are crucial for harnessing the future generation of supercomputers. In this talk proposal, we present our efforts towards productivity-aware ultra-scale tree search using the Chapel language. Four topics are covered in this document: the design and implementation of tree-search using Chapel, improving intra-node efficiency, the use of GPUs and future perspectives. |
|
11:50–12:10: | A Chapel Parallelisation of the Singular Value Decomposition [submission | slides | video | Q&A] |
Damian McGuckin (Pacific ESI), Peter Harding (PerformIQ) and Donald Carpenter (Pacific ESI) | |
Abstract: This paper discusses the parallelisation (using the programming language Chapel) of one of the most well quoted algorithms for the Singular Value Decomposition of a matrix. The mathematical alterations and programming constructs used to achieve that parallelisation are examined at length. Performance and validation tests showing the increased parallel performance that occurs when running on a multi-core computer architecture are documented. The serial performance is compared against Fortran code that implements the original algorithm. Only parallelisation in a Symmetric Multi-Processing (SMP) environment is explored. | |
12:10–12:25: | Break |
Session chair: Nikhil Padmanabhan (Yale) |
|
12:25–12:45: | GPUAPI: Multi-level Chapel Runtime API for GPUs [submission | slides | video | Q&A] |
Akihiro Hayashi, Sri Raj Paul and Vivek Sarkar (Georgia Institute of Technology) | |
Abstract:
Chapel is inherently well suited not only for homogeneous nodes but
also heterogeneous nodes because they employ the concept of locales,
distributed domains, forall/reduce constructs, and implicit
communications. However, it is unfortunate that there is room for
further improvements in supporting GPU in Chapel.
This paper addresses some of the key limitations of past approaches on mapping Chapel on to GPUs as follows. We introduce the GPUAPI module, which provides multi-level abstractions of existing low-level GPU API such as CUDA runtime API. This module allows Chapel programmers to have the option of explicitly manipulating device memory (de)allocation and data transfer API at the Chapel level while maintaining good performance and productivity. The GPUAPI module is useful particularly when they dive into lower-level details to incrementally evolve their GPU implementations for improved performance on multiple heterogeneous nodes. We provide two tiers of GPU API: the MID-LOW-level API and the MID-level API. The MID-LOW-level API offers thin wrappers for raw GPU API routines, whereas the MID-level API provides Chapel programmer-friendly interface - i.e., allocating device memory using the 'new' keyword. Also, the module allows the coexistence of different levels of API even with the prototype GPU code generator in Chapel 1.24. Our preliminary performance and productivity evaluations show that the use of the GPUAPI module significantly simplifies the manipulation of GPU API in Chapel for multiple CPUs+GPUs nodes while achieving the same performance. |
|
12:45–1:05: | Arkouda Set Operation Optimizations [submission | slides | video | Q&A] |
Ben McDonald (Gonzaga University) and Elliot Ronaghan (Hewlett Packard Enterprise) | |
Abstract: This talk discusses experiences of a summer intern in getting up and running on distributed memory systems using Chapel. During the course of the summer, the presenter learned how to use supercomputers thanks to the abstraction of complicated distributed computing concepts and high-level syntax of Chapel. Furthermore, this talk highlights significant contributions made to the performance of Arkouda, which is a NumPy-like Python package with a Chapel backend server that allows data scientists to interactively utilize supercomputers. The design of Arkouda will be outlined to better explain work accomplished and performance graphs of the improvements will be shown. | |
1:05–1:25: | Runtime Optimizations for Irregular Applications in Chapel [submission | slides | video w/ Q&A] |
Thomas Rolinger (University of Maryland), Christopher Krieger (Laboratory for Physical Sciences) and Alan Sussman (University of Maryland) | |
Abstract: Programming languages that implement the Partitioned Global Address Space (PGAS) model offer a simplified approach to writing parallel distributed applications, since explicit message passing is abstracted from the user. While the PGAS model offers high productivity, communication costs can still be a bottleneck for application performance. Applications that exhibit sparse and indirect memory accesses to distributed data pose a significant challenge to performance. These irregular applications lack spatial and temporal locality, leading to fine-grained remote communication that is not known until runtime. In this work, we investigate runtime optimizations for distributed irregular applications within the Chapel programming language. We focus on the inspector-executor technique, which evaluates a kernel of interest at runtime and constructs an optimized version of that kernel for execution. For our preliminary study, we hand-code the inspector and executor to demonstrate that runtime speed-ups as large as 224x, 13x and 96x are possible for Conjugate Gradient, a molecular dynamics simulation, and PageRank kernels, respectively. | |
1:25–1:40: | Break |
Session chair: Lydia Duncan (HPE) |
|
1:40–2:00: | Exploratory Large Scale Graph Analytics in Arkouda [submission | slides | video w/ Q&A] |
Zhihui Du, Oliver Alvarado Rodriguez, David A. Bader (New Jersey Institute of Technology), Michael Merrill and William Reus (US DoD) | |
Abstract: Exploratory graph analytics helps maximize the informational value for a graph. However, the increasing graph size makes it impossible for existing popular exploratory data analysis tools, such as Python, to handle dozens-of-terabytes or even larger data sets in the memory of a common laptop/personal computer. Arkouda is a framework under early-development that brings together the productivity of Python at the user side with the high-performance of Chapel at the server side. In this paper, the preliminary work on overcoming the memory limit and high performance computing coding roadblock for high level Python users to do large graph analysis is presented. A simple and succinct graph data structure design and implementation at both the Python front-end and the Chapel back-end in the Arkouda framework are provided. Less memory occupation and O(1) time complexity for neighbour vertices and adjacency edge searching are the major features of the proposed graph data structure. A typical graph algorithm, Breadth-First Search (BFS), is used to show how we can use Chapel to develop high performance parallel graph algorithm productively. Two Chapel based parallel Breadth-First Search (BFS) algorithms, one high level version and one optimized version, have been implemented in Arkouda to support large graph analyzing. Multiple graph benchmarks are used to evaluate the performance of the provided graph algorithms. Experimental results show that how we can optimize the performance using suitable Chapel high level data structure, parallel constructs and simple algorithm description efficiently. All our code is open source and available from GitHub (https://github.com/Bader-Research/arkouda/tree/graph-multilocales). | |
2:00–2:15: | Toward a Multi-GPU Implementation of a GMRES Solver in CHAMPS [submission | slides | video | Q&A] |
Anthony Bouchard, Matthieu Parenteau and Éric Laurendeau (Polytechnique Montreal) | |
Abstract: The Computational Fluid Dynamics (CFD) community has successfully leveraged GPUs for their solvers. In the industry, low-order solvers are often used because only engineering levels of accuracy are needed. Unlike high-order methods, these solvers don't have high ratios of floating point operations per memory fetches but can still make good use of GPUs because of the high number of elements computed and higher memory bandwidth of those types of hardware. These solvers often use solvers that were designed to be optimal for CPUs with sequential parts like the Symmetric Gauss Seidel (SGS) solver. In an attempt to adapt to the hardware architecture and to better utilize the computational power of the GPU, a Jacobian-free Newton-Krylov (JFNK) type of solver is envisioned. The JFNK solver makes use of the fact that only the effect of the Jacobian on a vector is needed, hence removing the need to store and inverse the Jacobian matrix. Instead, a finite-difference approximation is computed. This paper discusses the early implementation of such a solver by showing the performance on the GPU of a GMRES solver (with Jacobian) developed in CHAMPS, a 3D unstructured RANS solver written in Chapel. The performance is evaluated by presenting speedups and a strong scaling analysis of the method. | |
2:15–2:30: | HPC Workflow Management with Chapel [submission | slides | video w/ Q&A] |
Benjamin Albrecht (Hewlett Packard Enterprise) | |
Abstract: Coordinating many runs of monothilic high performance computing (HPC) applications is a challenge faced by much of the HPC user community. This includes domains such as data science, bioinformatics, astronomy, and computational chemistry. For simple cases, users tend to rely on shell scripts that interact with the system workload manager to launch their applications. However, more advanced workflows can require complexity beyond what can reasonably be accomplished in a shell script. A more productive programming language is needed to tackle these more complex tasks. Cray HPO is a black-box hyperparameter optimization framework written in Chapel. The framework employs many advanced workflow features, such as parallel launching, time budgets, and variable node counts. This talk will explore some HPC workflow design patterns encountered in the development Cray HPO and demonstrate why Chapel works well in this area. | |
2:30–2:45: | Break |
Session chair: Michelle Strout (HPE) |
|
2:45–?:??: | Open Discussion Session |
This final session is designed to support open discussion and interaction among the CHIUW attendees, and will consist of some lightning talks. If you would like to give a quick update abour your ongoing Chapel-related work, please let us know. | |
General Chair:
- Michelle Strout, HPE
- Engin Kayraklioglu (chair), HPE
- Rich Vuduc (co-chair), Georgia Tech
- Maryam Dehnavi, University of Toronto
- Clemens Grelck, University of Amsterdam
- Paul H. Hargrove, Lawrence Berkeley National Laboratory
- Josh Milthorpe, Australian National University
- Cathie Olschanowsky, Boise State University
- Mark Raugas, Pacific Northwest National Laboratory
- Tyler Simon, UMBC
- Christian Terboven, RWTH Aachen University
- Didem Unat, Koc University
- Jeff Vetter, Oak Ridge National Laboratory
- Brad Chamberlain, HPE
- Mike Merrill, U.S. DOD
- Nikhil Padmanabhan, Yale University
Call For Papers and Talks (for archival purposes)