The Chapel developer community is excited to announce the release of Chapel version 1.32! To obtain a copy, please refer to the Downloading Chapel page on the Chapel website.
Highlights of Chapel 1.32
Chapel 2.0 Release Candidate
The main highlight of Chapel 1.32 is that it is a release candidate for our forthcoming Chapel 2.0 release! If you’re not familiar with the concept of Chapel 2.0, it is intended to be a release that declares a core subset of the language and library features as ‘stable’. These features are ones that we intend to support in their current form going forward, such that code relying on them will not break across releases. Meanwhile, other features will be considered ‘unstable’, implying that they are ones where we are still learning from user experiences and refining interfaces before considering them to be stabilized. Unstable features may continue evolving after the 2.0 release, either by improving them until they too are stable, or replacing them with other, more stable features.
Chapel 1.32 being a 2.0 release candidate means that this is a key
time for Chapel users to give us feedback about aspects of our
design that they would like to see change prior to the 2.0 release.
Users may also want to compile their programs with the
--warn-unstable flag in order to identify any unstable features
that they are currently relying upon. Reliance on such features
could motivate you to advocate for stabilizing those features sooner,
or you could simply view it as an opportunity to be aware that those
features may continue to evolve over time. We are generally
interested in hearing about which unstable features user code is
currently relying upon, to help with our own prioritization efforts.
As part of the team’s push to make this a worthy Chapel 2.0 release candidate, Chapel 1.32 contains a large number of improvements to the language, compiler, and libraries. Some of these changes include:
new warnings to encourage a programming style in which generic types are more clearly visible in a program’s source code
a change in the default intent for arrays and record receivers (i.e.,
constfor greater uniformity with other types
revised definitions of the compiler’s interpretation of
constintents and default return/yield intents
significant improvements to ranges, domains, and distributions, including converting distribution types to records, obviating the need for the
major improvements to the
Timemodules, including a new IO serialization framework for specifying how to read and write types to files orthogonally from the file’s format (see below for more detail)
Version 1.32 includes significant improvements to Chapel’s support for vendor-neutral GPU programming, both in terms of performance and capabilities.
Key performance improvements include:
compiler optimizations to reduce the number of pointer dereferences when accessing arrays within GPU kernels
switching the default memory allocation scheme for arrays to ‘array_on_device’ mode, in which an array’s data is stored directly on the GPU rather than in managed memory
a reduction in overheads when invoking math routines within GPU kernels by eliminating unnecessary boilerplate wrapper code
using per-task GPU streams, which can enable communication-computation overlap to improve performance
The non-trivial impact of these optimizations can be seen in the following graphs, which show the improvements that have occurred in a Chapel port of the SHOC Sort benchmark on both NVIDIA and AMD GPUs. Note that the second graph includes data transfer times while the first does not.
Chapel’s support for AMD effectively reaches feature parity with
NVIDIA in this release, largely due to the addition of a number of
math routines that had not been supported for AMD in
Chapel 1.31. In addition, the Chapel compiler’s
can now be used to inspect the assembly code generated when
targeting AMD GPUs.
Meanwhile, when targeting NVIDIA GPUs, Chapel 1.32 adds support for
generating multi-architecture binaries by setting
a comma-separated list of target architectures.
See the latest GPU Programming technical note for additional details about these changes and Chapel’s overall support for GPUs in 1.32.
Support for Co-Locales
Since its inception, Chapel has preferred to represent each compute node as a single top-level locale, using multitasking to implement any intra-node parallelism. This approach has been beneficial in many problem domains where running a process per core could result in larger memory requirements or poor surface-to-volume effects due to the amount of [note: SPMD = Single Program, Multiple Data, a static and coarse-grained style of parallelism in which multiple copies of the same program are executed, e.g. one per processor core ] parallelism.
However, as modern compute nodes have begun to support multiple [note: NICs = Network Interface Chips, which permit processes to communicate with remote nodes ] this traditional approach has faced challenges. Specifically, it is unduly complicated to have a single locale (UNIX process) leverage multiple NICs effectively; yet using just one NIC leaves potential performance benefits on the floor by not exercising the network to its full capacity.
To address this, Chapel 1.32 introduces user-facing support for co-locales, in which multiple locales can be mapped to a single compute node. Using co-locales can lead to performance improvements by making better use of the network and/or reducing the number of memory references that cross between sockets. For example, the following charts show improvements to a pair of benchmarks when run using two locales per node on a dual-NIC HPE Cray EX system using Slingshot 11:
Current support is limited to running a locale per socket on a given compute node, and is also limited to certain platforms and configurations:
HPE Cray EX platforms with Slingshot 11 when using
InfiniBand-based systems when using
To opt-in to using co-locales, specify the number of locales for your Chapel program using a product of nodes and locales per node. For example, the following invocation:
$ ./myChapelProgram -nl 8x2
says to run the Chapel program on 8 nodes with 2 locales per node, for a total of 16 locales.
For more information on using co-locales with Chapel, please refer to the online documentation.
IO Serialization Framework
The IO serialization framework that was prototyped in Chapel
is now used by default for calls like
it is also available for use with types written by end-users.
As an illustration, consider the following example that prints an array in a couple of different formats:
uses a normal
writeln() to print the array of integers to the standard console
stdout) using Chapel’s traditional format—one element
at a time, separated by spaces. Then, in line 7, we create a
stdout that uses the JSON
write()s called on it. The result is that when we write
the array to this output stream in line 8, it is printed using
standard JSON formatting. Other current serializers support
as alternate formats.
The new serialization framework also includes deserializers, which support reading values back in from the given format. And most importantly, users can now define their own methods specifying how their types should be written or read. This can be done in a format-neutral manner for simplicity, or in a way that’s sensitive to the output format when needed. For more information on defining these methods, please refer to their online documentation.
Improved ARM64 Support
Thanks to our colleagues on the
Qthreads team at Sandia National
Laboratories, support for ARM64 chips is significantly improved in
Chapel 1.32. Specifically, this release bundles version 1.19 of
Qthreads, in which task creation and switching have been
re-implemented using assembly code for ARM64 chips. This can
dramatically reduce multitasking overheads when using Chapel’s
As a simple illustration, the following table shows the impact of this fast task switching on a 16-node run of Bale Index Gather using various implementation strategies:
|Approach||w/out fast tasks||with fast tasks||improvement|
|ordered||70.7 MB/s/node||84.7 MB/s/node||1.20x|
|ordered, oversubscribed||86.3 MB/s/node||140.4 MB/s/node||1.63x|
|unordered||147.5 MB/s/node||152.3 MB/s/node||1.03x|
|aggregated||1352.0 MB/s/node||1448.5 MB/s/node||1.07x|
In addition, Qthreads 1.19 also improved portability for ARM64-based
platforms. This enables the use of
CHPL_TASKS=qthreads on a wider
variety of systems, such as M1/M2 Macs, where it is now the default.
And much more…
Beyond the highlights mentioned here, Chapel 1.32 contains numerous other improvements to Chapel’s features and interfaces, such as:
initial support for array allocations that will throw if the system is out of memory
a more robust set of types and routines for dealing with C pointer types, particularly with respect to
initial support for interface declarations, to opt-in to special methods like the serialization methods mentioned above
features for power users to better understand the vectorization and transformation of their Chapel programs
support for selecting between processor types on chips with heterogeneous processing units
For a more complete list of changes in Chapel 1.32, please refer to its CHANGES.md file.
For More Information
For questions about any of the changes in this release, please reach out to the developer community on Discourse.
As always, we’re interested in feedback on how we can help make the Chapel language, libraries, implementation, and tools more useful to you in your work.
And always, thanks to everyone who contributed to the Chapel 1.32 release!