CHIUW 2015 Hot Topics Abstracts

Saturday, June 13, 2015, 4-5pm

 

The Chapel Memory Consistency Model
Sung-Eun Choi, Michael Ferguson, Elliot Ronaghan, Greg Titus (Cray Inc.)
Abstract: A memory consistency model is an important element of a parallel language because it allows programmers to understand programs and it allows implementations to optimize. In this talk, we will describe ongoing work on Chapel's memory consistency model. We will present the high-level design goals of the draft model, discuss how it avoids a pitfall or two from other language's models, and give a detailed description of one aspect that is unique to Chapel.

 

Fast Fourier Transforms in Chapel
Doru Thom Popovici, Franz Franchetti (Carnegie-Mellon University)
Abstract: The fast Fourier transform (FFT) is an important building block for a multitude of scientific applications from the High Performance Computing (HPC) community. Rather than implementing the FFT from scratch, application developers typically rely on pre-built and tuned FFT libraries such as the Intel MKL or FFTW. These libraries obtain performance by mapping efficient algorithms to the hardware features of the architecture that they target. This presentation will give insight in how we can map efficient recursive mixed-radix FFT algorithms to the language features provided by Chapel to facilitate parallel and distributed computation. The ultimate goal is to obtain a competitive Chapel implementation of the FFT algorithm that can achieve performance competitive with existing tuned libraries. To achieve this goal we plan to understand and uncover the optimizations that are required to bridge the performance gap between current Chapel code and library code. We plan to apply the lessons learned from building the Spiral autotuning and program generation system.

 

A Preliminary Performance Comparison of Chapel to MPI and MPI/OpenMP
Laura Brown (US Army Engineer Research and Development Center)
Abstract: As the High Performance Computing community moves toward peta- and exascale computing, we need to begin evaluating alternatives, such as Chapel, to MPI for parallel computing in order to achieve optimal efficiency and scalability on large HPC systems. Any viable alternatives, though, will need to be easy to use and provide comparable (or better) performance to MPI. As part of a larger study of parallel programming languages, I translated a small, non-trivial program into Chapel and evaluated its performance on a large production system. Then, these results were compared to the observed performance of runs made with MPI and MPI/OpenMP versions of this program. This talk will discuss the outcome of this study, along with my initial impressions of Chapel as a usable parallel programming language.

 

Data flow programming—a high performance and highly complicated programming concept?
Jens Breitbart (Technische Universität München)
Abstract: This talk gives a short introduction to GASPI, which is a low-level one-sided communication library using similar synchronization mechanisms as are available in Chapel. GASPI has been developed with a strict focus on performance, and applications using GASPI often beat tuned MPI applications. However, most users have been unable to utilize the power of the library due to the complexity that arises from the synchronization primitives. The talk will focus on the issue that the users faced when using GASPI and how Chapel may provide a better end-user while still providing high performance.

 

If you can dodge a wrench, you can dodge a ball
Dylan Stark, George Stelle (Sandia National Laboratories)
Abstract: This talk will focus on the importance of low-level runtime configuration choices for achieving high performance, and how application and node architecture can necessitate different choices. The Chapel programming language significantly lowers the programmability barrier for writing parallel applications by providing clean semantics and abstractions for managing concurrency and data. Nevertheless, concurrent execution and dynamic management of on-node parallel resources is the responsibility of the underlying task layer. In the case of the current default, Sandia¹s Qthreads library, we show that mindful configuration for the node architecture and application is essential, and that making the wrong choices can be ruinous. That is to say, by avoiding performance pitfalls in the underlying thread layer (the wrench), we improve the likelihood of avoiding performance pitfalls in the higher level language (the ball).

 

A Progress Report on COHX: Chapel on HSA + XTQ
Mauricio Breternitz, Bibek Ghimire, Mike Chu, Steve Reinhardt (Advanced Micro Devices (AMD))
Abstract: We report on our experience porting Chapel to the eXtended Task Queueing model (XTQ), an extension to HSA - Heterogeneous System Architecture. The HSA Architecture enables user-level tasking via architecturally-defined task-enqueueing to CPU and GPU task queues. XTQ extends HSA by enabling cross-node task queuing via RDMA access to HSA queues on remote nodes. We describe our approach and experience in porting Chapel to utilize XTQ. This comprises identifying and insulating Chapel runtime components that are updated to support this organization. We also describe initial experience with running Chapel-generated XTQ-enabled binaries in two environments: an emulation layer, which provides the XTQ API and runs on an HSA-enabled infiniband-connected cluster, as well as a gem5-based simulation model, which provides the XTQ API via a NIC device. The XTQ API is implemented and presented as an extension to the Portals 4 interface, underneath Chapel's GasNET layer. Initial microbenchmark results indicate potential speedup via low-latency intra- and inter-node task enqueueing.