CHIUW 2023
The 10th Annual
Chapel Implementers and Users Workshop
Coding Day
Thursday June 1st, 2023
Participants Only
Workshop Day
Friday June 2nd, 2023
8:00am–3:00pm PDT (GMT–7)
free and online via Zoom
Thursday June 1st, 2023
Participants Only
Friday June 2nd, 2023
8:00am–3:00pm PDT (GMT–7)
free and online via Zoom
CHIUW 2023 is the 10th annual Chapel Implementers and Users Workshop, which serves as a forum where users and developers of the general-purpose Chapel programming language (chapel-lang.org) can meet to report on work being done with Chapel, exchange ideas, and forge new collaborations. Anyone interested in parallel programming and/or Chapel is encouraged to attend CHIUW, from long-term enthusiasts to those simply curious to learn more. This year's CHIUW will be online and there will be no registration fees.
Registration for CHIUW 2023 is free and can be completed at this link.
anytime | Chapel 101 [slides | video] | ||
Brad Chamberlain (Hewlett Packard Enterprise) | |||
This is a completely optional talk for those who are new to Chapel and looking for a crash-course, or for those who would simply appreciate a refresher. | |||
all day | Coding Day | ||
This day will consist of asynchronous sessions where Chapel developers help users and enthusiasts with their Chapel code. Coding Day submissions are closed. | |||
Time (PDT) | |||
8:00–8:35 |
Welcome
[slides
| video]
State of the Project [slides | video] |
||
Engin Kayraklioglu, Brad Chamberlain (Hewlett Packard Enterprise) | |||
This session will serve as a welcome to and overview of CHIUW 2023, along with a brief summary of highlights and milestones achieved within the Chapel project since last year. | |||
Session chair: Daniel Fedorin (HPE) |
|||
8:35–8:55 | Coupling Chapel-Powered HPC Workflows for Python [submission | slides | video] | ||
John Byrne, Harumi Kuno, Chinmay Ghosh, Porno Shome, Amitha C, Sharad Singhal, Clarete Crasta, David Emberson, and Abhishek Dwaraki (Hewlett Packard Enterprise) | |||
Abstract: Decades ago, when data analytics was known as data mining, there was an adage – “No data, no mining!” The pendulum has swung to the opposite extreme, as everything from hospitals to cars now produce massive quantities of data. We address the challenge of how to lower the barrier for efficiently processing massive quantities of data. We describe a solution that enables ordinary Python programmers to share results while working efficiently with datasets that can be too large to process using a single commodity machine. Our solution leverages Chapel, Arkouda, and OpenFAM to hide complexity, transparently enabling programmers to process large amounts of data on clusters of compute nodes while making it easy for them to share and incrementally maintain derived datasets. | |||
8:55–9:15 | Towards a Scalable Load Balancing for Productivity-Aware Tree-Search [submission | slides | video] | ||
Guillaume Helbecque, Jan Gmys, Tiago Carneiro, Nouredine Melab and Pascal Bouvry (Université de Lille, University of Luxembourg) | |||
Abstract: In the context of exascale programming, we investigate a parallel distributed productivity-aware tree-search for exact optimization in Chapel. To this end, we present the DistBag-DFS distributed data structure, which is our revisited version of the Chapel’s DistBag data structure for depth-first search. The latter implements a distributed multi-pool, as well as an underlying locality-aware load balancing mechanism. Extensive experiments on large unbalanced tree-based problems are performed, and the competitiveness of our approach is reported against MPI+X implementations. For our best results, we achieve 94% of the ideal speed-up, using up to 64 computer nodes (8192 cores). | |||
9:15–9:30 | Break | ||
Session chair: Dan Bonachea (Lawrence Berkeley National Laboratory) |
|||
9:30–9:45 | High-Performance Programming and Execution of a Coral Biodiversity Mapping Algorithm Using Chapel [submission | slides | video] | ||
Scott Bachman, Rebecca Green, Anna Bakker, Helen Fox, Sam Purkis and Ben Harshbarger (National Center for Atmospheric Research, The Coral Reef Alliance, University of Miami, Hewlett Packard Enterprise) | |||
Abstract: This paper will demonstrate how the parallelism and expressiveness of the Chapel programming language are used to achieve an enormous improvement in computational speed for a problem related to coral reef conservation. Chapel’s concise syntax and versatile data structures enable this problem to be solved in under 300 lines of code, while reducing the time to solution from days down to the order of seconds. This improvement is so substantial that it represents a paradigm shift in the way biodiversity can be measured at scale, providing a wealth of novel information for marine ecosystem managers and opening up brand new avenues for scientific inquiry. This paper will review the solution strategy and data structures in Chapel that allowed these improvements to be realized, and will preview future extensions of this work that have been made possible by this drastic speedup. | |||
9:45–10:00 | A Record-Based Pointer to Fabric Attached Memory [submission | slides | video] | ||
Amitha C, Clarete Crasta, Brad Chamberlain, Sharad Singhal, Porno Shome and Dave Emberson (Hewlett Packard Enterprise) | |||
Abstract: Fabric Attached Memory (FAM) enables fast access to large datasets required in High Performance Data Analytics (HPDA) and Exploratory Data Analytics (EDA) applications. The Chapel language is designed for such applications and helps programmers via high-level programming constructs that are easy to use, while delegating the task of managing data and compute partitioning across the cluster to the Chapel compiler and runtime. Our previous work integrates FAM access within Chapel using a language-provided feature called user-defined array distributions. To support more general computational patterns using FAM from Chapel through abstracted language constructs, we have enabled a record-based pointer type to the FAM-resident data object and enabled access to the FAM memory through these pointers. | |||
10:00–10:15 | Automatic Adaptive Prefetching for Fine-Grain Communication in Chapel [submission | slides | video] | ||
Thomas Rolinger and Alan Sussman (University of Maryland) | |||
Abstract: Applications that operate on large, sparse graphs and matrices exhibit fine-grain irregular memory accesses patterns, leading to both performance and productivity challenges on today's distributed-memory systems. The Partitioned Global Address Space (PGAS) model attempts to address these challenges by combining the memory of physically distributed nodes into a logical global address space, simplifying how programmers perform communication in their applications. Chapel is an example of a programming language that implements a PGAS. However, while Chapel and the PGAS model can provide high developer productivity, the performance issues that arise from irregular memory accesses are still present. In this talk, we will discuss an approach to improve the performance of Chapel programs that exhibit fine-grain remote accesses while maintaining the high productivity benefits of the PGAS model. To achieve this goal, we designed and implemented a compiler optimization that performs adaptive prefetching for remote data. Specifically, the compiler performs static analysis to identify irregular memory access patterns to distributed arrays in parallel loops and then applies code transformations to prefetch remote data that will be needed in future loop iterations. Our approach is adaptive because the prefetch distance (i.e., how many iterations ahead to prefetch) is automatically adjusted as the program executes to ensure the prefetches are not issued too early or too late. Furthermore, the optimization is fully automatic and requires no user intervention. We demonstrate runtime speed-ups as large as 3.2x via adaptive prefetching when compared to unoptimized baseline implementations of various irregular workloads across three different distributed-memory systems. | |||
10:15–10:30 | Break | ||
Session chair: Brad Chamberlain (HPE) |
|||
10:30–11:30 | PGAS Programming Models: My 20-year Perspective [slides | video] | ||
Paul Hargrove (Lawrence Berkeley National Laboratory) | |||
|
|||
Bio:
Dr. Paul Hargrove received his Ph.D. from Stanford University's
Program in Scientific Computing and Computational Mathematics in
2004. Prior to that Paul received a Bachelor of Arts degree from
Cornell University in 1994, completing a triple major in Physics
(magna cum laude), Math, and Computer Science.
Paul has been a PI at Lawrence Berkeley National Lab (LBNL) since September 2000, following periods of summer and part-time employment at LBNL. His current research focuses on network communications for HPC, with current software projects including UPC++ and Global Address Space Networking (GASNet-EX). Paul is PI of the Pagoda project, funded by the US Department of Energy's Exascale Computing Project (ECP), under which UPC++ and GASNet-EX are developed. |
|||
11:30–11:45 | Break | ||
Session chair: Josh Milthorpe (Oak Ridge National Laboratory) |
|||
11:45–12:00 | Too Big to Fail: Massive Scale Linear Algebra with Chapel and Arkouda [submission | slides | video] | ||
Christopher Hollis (U.S. Department of Defense) | |||
Abstract: This presentation details the development of a linear algebra extension for Arkouda (a NumPy-like Python application that utilizes Chapel for a backend server). This interface, dubbed AkSparse, allows for the creation and manipulation of sparse matrices at large scale with features designed to be familiar to users of SciPy’s existing sparse array package. This includes a sparse general matrix-matrix multiplication (SpGEMM) implemented with a novel algorithm leveraging the strengths of both Arkouda and Chapel. AkSparse allows users to integrate linear algebraic techniques into existing exploratory data analysis (EDA) workflows on datasets at a scale not previously possible. | |||
12:00–12:15 | Minimum-Mapping based Connected Components Algorithm [submission | slides | video] | ||
Zhihui Du, Oliver Alvarado Rodriguez, Fuhuan Li, Mohammad Dindoost and David A. Bader (New Jersey Institute of Technology) | |||
Abstract: Finding connected components is a fundamental problem in graph analysis. We develop a novel minimum-mapping based Contour algorithm to solve the connectivity problem. The Contour algorithm can identify all connected components of an undirected graph within O(log(dmax)) iterations on m parallel processors, where dmax is the largest diameter of all components in a given graph and $m$ is the total number of edges of the given graph. Furthermore, each iteration can easily be parallelized by employing the highly efficient minimum-mapping operator on all edges. To improve performance, the Contour algorithm is further optimized through asynchronous updates and simplified atomic operations. Our algorithm has been integrated into an open-source framework, Arachne, that extends Arkouda for large-scale interactive graph analytics with a Python API powered by the high-productivity parallel language Chapel. Experimental results on real-world and synthetic graphs show that the proposed Contour algorithm needs less number of iterations and can achieve 5.26 folds of speedup on average compared with the state-of-the-art connected component method FastSV implemented in Chapel. All code is publicly available on GitHub (https://github.com/Bears-R-Us/arkouda-njit). | |||
12:15–12:30 | Removing Temporary Arrays in Arkouda [submission | slides | video] | ||
Ben McDonald (Hewlett Packard Enterprise) | |||
Abstract: This talk discusses experimental modifications made to the Arkouda (a NumPy-like Python package with a Chapel backend server) messaging layer to pass several operations together as a block of Lisp code to be parsed on the server in one message, as opposed to the existing model of each command being passed as individual messages, requiring multiple passes to evaluate compound expressions. These modifications were made to eliminate the need for extra temporary array creation when executing compound operations in Arkouda. To improve the performance of the implementation, the initial code, which parsed the Lisp code once per-task, was optimized to parse only once per message and remove dynamic allocations. The implementation is evaluated by comparing it against current Arkouda performance. The results of the comparison show that the Lisp interpreter is not yet outperforming standard Arkouda code, but additional functionality can be supported through this new feature. | |||
12:30–12:45 | Break | ||
Session chair: Ben McDonald (HPE) |
|||
12:45–12:55 | Development of a Knowledge-Sharing Parallel Computing Approach for Calibrating Distributed Watershed Hydrologic Models [submission | slides | video] | ||
Marjan Asgari (University of Guelph) | |||
Abstract: A research gap in calibrating distributed watershed hydrologic models lies in the development of calibration frameworks adaptable to increasing complexity of hydrologic models. Parallel computing is a promising approach to address this gap. However, parallel calibration approaches should be fault-tolerant, portable, and easy to implement with minimum communication overhead for fast knowledge sharing between parallel nodes. Accordingly, we developed a knowledge-sharing parallel calibration approach using Chapel programming language, with which we implemented the Parallel Dynamically Dimensioned Search (DDS) algorithm by adopting multiple perturbation factors and parallel dynamic searching strategies to keep a balance between exploration and exploitation of the search space. Our results showed that this approach achieved super-linear speedup and parallel efficiency above 75%. In addition, our approach has a low communication overhead, along with the positive impact of knowledge-sharing in the convergence behavior of the parallel DDS algorithm. | |||
12:55–1:05 | Parallel Implementation in Chapel for the Numerical Solution of the 3D Poisson Problem [submission | slides | video] | ||
Anna Jesus, Livia Freire, Willian Carlos Lesinhovski and Nelson Dias (University of Sao Paulo, Federal University of Paraná) | |||
Abstract: In this study, we present a parallel implementation of the numerical Poisson equation with domain decomposition in three directions using the Chapel programming language. Our goal is to study the potential of Chapel as an easy-to-implement alternative to a code originally developed in Fortran+MPI. The numerical experiments were performed on the cluster of the Instituto de Ciências Matemáticas e de Computação of the University of São Paulo, on a grid 1303 points, corresponding to 2097152 unknowns. The results, for a single node only, suggest that the performance of Chapel tends to vary between 30-80% compared to the Fortran+MPI code with up to 32 threads. | |||
1:05–1:20 | Runtime Comparison Between Chapel and Fortran [submission | slides | video] | ||
Willian Lesinhovski, Nelson Dias, Livia Freire and Anna Jesus (Federal University of Paraná, University of Sao Paulo) | |||
Abstract: In this text we present a simple but interesting runtime comparison between Chapel and Fortran when performing some very common algorithms in numerical analysis: matrix multiplication, Lax method for the kinematic wave equation and SOR method for the Poisson equation. Chapel presented a very satisfactory performance reducing the processing time from 10% to 50% compared to Fortran. | |||
1:20–1:35 | Break | ||
Session chair: Harumi Kuno (HPE) |
|||
1:35–1:50 | Accelerating Data Analytics with Arkouda on GPUs [submission | slides | video] | ||
Josh Milthorpe, Brett Eiffert and Jeffrey Vetter (Oak Ridge National Laboratory) | |||
Abstract: In this talk, we will use demonstrate how the Chapel GPU API can be used to accelerate Arkouda operations, which is most beneficial when a chain of operations is executed on the same data. We extend the GPU API to support shared virtual memory using CUDA unified memory and use this support to implement a custom domain map for Arkouda arrays. Our preliminary performance results show that GPU-accelerated operations in unified memory perform comparably or better than explicit memory management while simplifying the programming task for complex Arkouda operations. | |||
1:50–2:05 | Enabling CHIP-SPV in Chapel GPUAPI module [submission | slides | video] | ||
Jisheng Zhao, Akihiro Hayashi, Brice Videau and Vivek Sarkar (Georgia Institute of Technology, Argonne National Laboratory) | |||
Abstract: This talk discusses enhancing support for Intel GPUs in the Chapel GPUAPI module. Essentially, we introduce the CHIP-SPV framework as a backend for the module, allowing the user to run their hand-written CUDA/HIP kernels on Intel GPUs as-is from their Chapel programs and allowing the runtime to perform finer-grain control of Intel GPUs through Intel Level Zero runtime. In particular, we discuss the design and implementation of our CHIP-SPV backend in the GPUAPI module and demonstrate a preliminary performance evaluation of the backend on an Intel GPU platform. We also plan to discuss the possibility of using CHIP- SPV as a general code generation target in Chapel’s GPU code generator to enhance its portability. | |||
2:05–2:20 | Initial Experiences in Porting a GPU Graph Analysis Workload from CUDA/SYCL to Chapel [submission | slides | video] | ||
Paul Sathre, Atharva Gondhalekar and Wu-Chun Feng (Virginia Tech) | |||
Abstract: In this talk, we will discuss our initial experiences in porting a GPU graph analysis proxy workload from CUDA/SYCL to Chapel. This endeavor is part of a broader study to characterize the performance-productivity tradeoffs of Chapel's new native support for compiling loops for GPU execution. We will discuss the motivation for migrating to Chapel, provide an introduction to our proxy application known as edge-connected Jaccard similarity, and briefly discuss code migration issues and preliminary performance observations. | |||
2:20–2:40 | Recent GPU Programming Improvements in Chapel [submission | slides | video] | ||
Engin Kayraklioglu, Andy Stone and Daniel Fedorin (Hewlett Packard Enterprise) | |||
Abstract: Chapel’s emerging native GPU programming support has improved considerably in the last year. In this talk, we will highlight some of the improvements and discuss our next steps. | |||
2:40–?:?? | Open Discussion Session | ||
This final session is designed to support open discussion and
interaction among the CHIUW attendees, and to provide an
opportunity for lightning talks. This year's session
included:
|
|||
General Chair:
- Michelle Strout, HPE
- Engin Kayraklioglu (chair), HPE
- Dave Wonnacott (co-chair), Haverford College
- Scott Bachman, National Center for Atmospheric Research/HPE
- Dan Bonachea, Lawrence Berkeley National Laboratory
- Maryam Mehri Dehnavi, University of Toronto
- Nelson Luís Dias, Federal University of Paraná
- Akihiro Hayashi, Georgia Tech
- Harumi Kuno, HPE
- Josh Milthorpe, Oak Ridge National Laboratory
- Thomas Rolinger, University of Maryland
- Rich Vuduc, Georgia Tech
- Andrew Younge, Sandia National Laboratories
- Brad Chamberlain, HPE
- Éric Laurendeau, Polytechnique Montreal
- Bill Reus, US DoD
- Didem Unat, Koc University
Call For Papers and Talks (for archival purposes)