Optimizing Performance of Chapel Programs¶
One of the benefits of the Chapel programming language is that it makes it easy to get started writing parallel and distributed programs. Features including distributed arrays, implicit communication, and global view make it easier to go from a serial program, to a parallel program, to a distributed-memory parallel program. However, these same features can also create performance problems. It takes additional work to go from a distributed-memory parallel program to one that performs well. This document provides a guide of sorts along the way. It will highlight the issues that one needs to be aware of when optimizing the performance of a Chapel program and talk about how the performance issues can be resolved.
Note
This document is focused on how Chapel users can improve the performance of their applications. There are many potential areas for improvement in the compiler, runtime, and standard library that can also improve the performance of Chapel programs. In fact, some optimizations to the Chapel programming system have made benchmarks run many times faster without needing any change to the application source code at all! These are important techniques as well, but they are out of scope for this document.
Note
This document is a draft and a work in progress. Updates to it based on experience with optimization are welcome!
A Few Words About Process¶
Before we dive into optimization techniques, let’s briefly discuss some good programming practices when working with Chapel. Optimization efforts will involve changing the code, and optimizations that cause the program to stop working correctly aren’t any use at all. It needs to be easy to check that everything is working.
Revision Control¶
It’s important to use revision control tools (e.g. git
). Revision
control helps with optimization specifically:
You can go back to an earlier version if your changes break the program or make it slower
You can record your progress towards improving performance
Testing¶
It’s important to have tests that are easy to run in order to check that your program is working correctly.
We have used the following steps when testing a new change to a parallel and distributed Chapel application:
Compile with checks enabled (i.e. without
--fast
)
Test with 1 locale and
--dataParTasksPerLocale=1
(dataParTasksPerLocale
is described in Configuration Constants for Default Data Parallelism)Gradually increase
--dataParTasksPerLocale
until you are satisfied it is working. At a minimum, try leaving this argument off to run with number of tasks == number of cores.Test while simulating multiple locales on your development system (see the next section)
Gradually increase the number of locales and the number of tasks until you are satisfied it is working.
You can compile with --fast
and measure performance before or after
this process. If you are targeting distributed memory, it’s ideal to test
performance on a multi-node system. Occasionally, it’s useful to simulate
multiple locales on a laptop (described below). Be aware that the
performance of such a simulation won’t necessarily match a real
multi-node system.
Simulating Multiple Locales on a Laptop¶
It’s possible to run multiple processes on your laptop or workstation that simulate the communication that happens on a big multi-node system. This is also called running locally oversubscribed.
In this setting, the program will generally run much slower than it would
when running in a multi-node system. As a result, you might need to run
with a smaller problem size to test in this way. Nonetheless, it is
possible to identify and optimize communication. Communication occurs in
Chapel programs when using on
statements and also when reading or
writing data (such as array elements) that are stored on a remote locale.
In communication-bound applications, optimizing the communication can make the locally oversubscribed program run faster. Another important technique is, when working locally oversubscribed, try to reduce the communication counts (communication counts are discussed below). The communication counts will match (or come close to matching) between running locally oversubscribed and running on a multi-node system with the same number of locales. Significant reductions in communication counts will generally lead to improvements in the scalability of your application on multiple nodes.
To run locally oversubscribed, build Chapel with GASNet with the UDP conduit. Note that GASNet with the UDP conduit is not a high-performing configuration; in this setting that is OK because the communication overhead of UDP will approximate the communication overhead on a real multi-node system. You can use these commands to build with the UDP conduit:
export CHPL_COMM=gasnet export CHPL_COMM_SUBSTRATE=udp make
Then, run with these settings:
export CHPL_RT_OVERSUBSCRIBED=yes export GASNET_SPAWNFN=L export CHPL_RT_MASTERIP=127.0.0.1
You might also wish to run with a lower number of threads per simulated
locale in order to reduce interference between the processes. For
example, if you are working on a system with 32 cores, you might run
with CHPL_RT_NUM_THREADS_PER_LOCALE=8 ./application -nl 4
. That
uses 8 threads per locale, with 4 locales, so a total of 32 threads.
See also Controlling the Number of Threads, Oversubscription, and Using the Portable UDP Conduit.
Measuring Performance¶
- Reproducible Performance Tests
It’s important to have performance tests that are easy to run that you can use to measure your program’s performance. These performance test cases need to be easy to repeat. Usually, that means setting aside some problem configuration and input data to use as the main performance test case. In some cases, you will need a battery of performance tests instead of a single test case.
- Timing Regions
A common practice is to use
stopwatch
to time different regions of your program. This coarse-grained timing information indicates where your program is spending the most time. The longest-running regions are the places where optimization will give the most improvement to the program’s performance. You can print out the performance of each region after it is timed. It’s a good idea to use aconfig const
to make the timing only print when you are studying performance.- Counting Communication Events
If you need to do most of your work by running locally oversubscribed (see Simulating Multiple Locales on a Laptop) then it’s particularly useful to measure performance by counting the number of communication events. Counting communication events can also be useful as a secondary performance metric for multi-node runs. Reducing the number of communication events is a useful strategy for optimizing communication.
You can use the
CommDiagnostics
module andstartCommDiagnostics()
followed bygetCommDiagnostics()
to count communication events. It’s common to print out the communication counts withwriteln(getCommDiagnostics())
.When measuring communication counts, it’s important to be aware of these best practices:
Compile with
--fast
to avoid communication events related to error checkingCommunication count information will be easier to understand if you compile with
--no-cache-remote
because the cache for remote data, which is enabled by default, will cause the counts to include more categories.Counting communication events can slow down execution time significantly. As a result, it’s a good practice to control communication diagnostics with a
config const
and to measure timing only when communication counting is off.
Communication counts provide a way to compare communication performance independent of where you are running. For example, you might measure and seek to reduce the communication counts when running oversubscribed on a laptop or workstation. Significant reductions in communication counts from such an effort should also help with multi-node performance when you are able to run on a big system.
Overall Flow¶
The general flow of performance optimization will be:
use techniques like timing the different phases of your computation to try to understand which parts are the slow or not getting faster as you add more locales or cores
come up with a theory as to what could be improved by thinking about what is happening when your program is running and/or by looking at timing or communication count information
make an adjustment (the adjustment only needs to be good enough to measure its impact on performance at this step)
measure its impact on performance
if it is good, clean up the adjustment, perform correctness testing, and commit/PR it to your program’s revision control; if not, try something else
repeat
Configuration Matters¶
The first thing to check is that you are using Chapel in its highest-performing configuration for your system.
use
--fast
when compiling Chapel programs for performance to tell the compiler to optimize instead of adding correctness checksuse the default configuration rather than the quickstart configuration
Note
You can show the current configuration with
printchplenv
orchpl --print-chpl-settings
In particular,
CHPL_TASKS=qthreads
is generally faster thanCHPL_TASKS=fifo
, andCHPL_TARGET_MEM=jemalloc
is generally faster thanCHPL_TARGET_MEM=cstdlib
(see also CHPL_TASKS and CHPL_TARGET_MEM)
CHPL_TARGET_COMPILER=llvm
might or might not be faster than using the C backend with something likeCHPL_TARGET_COMPILER=gnu
; however the performance of the LLVM backend is more reliable (see also CHPL_*_COMPILER)
if you are using
CHPL_LLVM=system
, it’s a good idea to match the version of LLVM bundled in the Chapel release if possible as this has recieved the most attention and testing (see also CHPL_LLVM)For multi-locale programs, use a high-performance networking configuration
The following configurations support portability and correctness testing but are NOT designed for high performance:
CHPL_COMM=gasnet
CHPL_COMM_SUBSTRATE=udp
CHPL_COMM=gasnet
CHPL_COMM_SUBSTRATE=mpi
High-performance networking configuration will depend on your system:
For HPE Cray EX, use
CHPL_COMM=ofi
(see also Using Chapel on HPE Cray Systems)For InfiniBand systems, use
CHPL_COMM=gasnet
CHPL_COMM_SUBSTRATE=ibv
(see also Using Chapel with InfiniBand)For Omni-Path systems, use
CHPL_COMM=gasnet
andCHPL_COMM_SUBSTRATE=ofi
(see also Using Chapel with Omni-Path)If you are not using multiple locales, set
CHPL_COMM=none
or compile your application with--local
for the best performance (see Wide Pointer Overhead).
Settings to Adjust to Improve Performance¶
This section contains some easy things to try in order to improve performance.
--fast
If you haven’t been using
--fast
yet please do! It should be used when measuring performance. Since it disables bounds checking, make sure that your development flow includes correctness tests that aren’t compiled with--fast
.--no-ieee-float
/--ieee-float
By default, only floating point optimizations that are relatively benign are enabled. Depending on your application, you might use
--no-ieee-float
to enable optimizations that might impact the numerical accuracy. Or, if your program relies on floating point operations happening in the order written for numerical accuracy, you should use--ieee-float
.- colocales
In some settings, running with multiple colocales per node can improve performance. For example, to run on 8 nodes with 2 processes per node, you could use
-nl 8x2
. That will result in 16 locales; where each node has 2 locales. Using colocales can help with memory bandwidth on NUMA systems and also can better use the networking resources on a node to help to make communication more efficient.--auto-aggregation
This compiler flag enables an optimization that automatically uses
CopyAggregation
to improve multilocale performance. It is not on by default because it can slow down some applications. In particular, this optimization can be beneficial if your application has small forall loops that do a lot of random remote accesses, but if the forall loops do mostly local accesses, it can hurt performance. See also Fine-Grained Communication which discusses the problem that aggregators solve.--no-cache-remote
/--cache-remote
The cache for remote data is a runtime component that helps to reduce fine-grained communication. It is enabled by default, but in some cases, an application will run faster with it disabled. It is also usually a good idea to disable it when investigating the sources of communication as the communication logs are simpler when it is disabled.
CHPL_TARGET_CPU
Using
native
or the CPU family that you are targeting, rather thannone
orunknown
, can allow using newer instruction sets (e.g. AVX512) and improve performance.
Fundamental Issues¶
This section covers issues that are fundamental to the Chapel programming model. As a result, people optimizing Chapel programs should be aware of them.
Accidental Communication¶
The Chapel programming model supports implicit communication in order to make it easier to write distributed-memory programs. While those first distributed-memory programs are easier to write, they might include accidental communication. The accidental communication can be a big barrier to scalability because it’s frequently accessing the same memory on one node repeatedly.
Addressing accidental communication consists of two parts. First, the accidental communication needs to be identified. Second, the code needs to be modified to avoid the accidental communication.
Here are a few strategies to identify accidental communication:
Use
local
blocks (see also The ‘local’ keyword).local
blocks are an unstable feature that instructs the compiler that there should be no communication within the code in that block, including in functions called from within the local block. When the program is compiled with--fast --local-checks
(or with the default of full checking), the compiler will emit code to halt if code running in alocal
block needs to communicate. If you have compiled with CHPL_UNWIND != none, you can even see the stack trace for the code which caused communication you did not expect.
local
blocks have a secondary advantage of allowing the compiler to optimize: it optimizes assuming that all code inlocal
blocks does not communicate if you compile with--fast
.Note that
local
blocks can slow down compilation significantly if not used sparingly. If compilation time is important, consider using them as a debugging tool and then removing to improve compilation speed.Use
CommDiagnostics
on-the-fly reporting. TheCommDiagnostics
module provides mechanisms for on-the-fly reporting withstartVerboseComm()
. This on-the-fly reporting can even include stack traces if you compile with-scommDiagsStacktrace=true
and have built Chapel with CHPL_UNWIND != none. The on-the-fly reporting provides a relatively easy way to see what communication events are common in your program. It can be a lot of output though. This strategy works reasonably well for finding accidental communication that is a performance problem because, if accidental communication is happening in a key performance-critical inner loop, verbose comms reporting will report on that accidental communication many times. With some light scripting to process the verbose comms reporting, you can identify which parts of the code have the most communication.
Here are some strategies you can use to adjust your code to avoid accidental communication:
If the value being accidentally communicated can be stored in a variable outside of the distributed loop or if it can be stored in a module-scope variable, storing it in a
const
variable can enable a key compiler optimization called remote value forwarding. This optimization allows the compiler to move the value of the variable along with the message sent to a remote locale to set up work. However, it only works if the compiler can prove that the value will not change.const
helps because it indicates to the compiler that the value won’t change. For example:{ var A = blockDist.createArray(...); var x = 22; forall elt in A { elt = x; // uh-oh, x might be read remotely on each iteration! } const y = 22; forall elt in A { elt = y; // expect the value of 'y' to be sent along with tasks } }Note
Actually, in this specific example,
x
will be copied to each task implementing the firstforall
loop due to Forall Intents. The unnecessary communication described could occur ifx
were a record.If the accidental communication is within a distributed
forall
loop, you can change it from being once per iteration to once per task by using thein
intent. For example:var A = blockDist.createArray(...); var x = 22; forall elt in A { elt = x; // uh-oh, x might be read remotely on each iteration! } forall elt in A with (in x) { elt = x; // ah, now x is only read once per task, at least }If the code is using a structure like
coforall loc in Locales
, you can create a temporary local variable to store a local copy of the variable. For example:var x = [1,2,3,4]; coforall loc in Locales { on loc { var myX = x; f(myX); // do something with myX } }
Wide Pointer Overhead¶
The Chapel compiler will emit code working with pointers in many cases (for array elements, references, class instances, …). When the Chapel compiler is compiling for multiple locales and it is unable to prove that a pointer is local, it will emit a wide pointer. The wide pointer encodes an address along with a locale where the value is stored. In cases where the code is working with local memory but the compiler can’t prove that, there will be additional overhead due to the code working with a wide pointer.
It is relatively easy to detect if this is a performance problem for a
Chapel program because it has a pretty clear symptom that you can see by
compiling two different ways and running on 1 locale. Measure the
performance of your program compiled with CHPL_COMM=none
and/or
--local
. Compare that performance with the performance of your
program with CHPL_COMM
other than none
and/or with
--no-local
. You are seeing wide pointer overhead if the --local
version is significantly faster than the --no-local
version.
What can be done about it?
use
local
blocks to tell the compiler that code within a block will not communicate. That allows it to remove wide pointers for that code. See also Accidental Communication.use
localSubdomain
to compute the local index set as a local domainuse
localSlice
to compute a non-distributed array slice that refers to the local portioncompile with
--report-auto-local-access
to see which local array accesses are optimized by the compiler. UselocalAccess
for accesses that aren’t optimized.Note
A bit more about the Automatic Local Access (ALA) optimization
The Chapel compiler can optimize distributed array accesses inside
forall
loops if it can determine that they can be always local. Here are some cases that it can optimize:var A, B: [D] int; var OtherDomain = ... forall i in A.domain { ... A[i] ... } forall i in B.domain { ... A[i] ... } forall i in D { ... A[i] ... } forall i in D.expand(-1) { ... A[i] ... } forall i in OtherDomain { ... A[i] ... } // with some execution time checksThis optimization can also be effective when Stencil operations when Stencil-distributed arrays are used:
use StencilDist; var A = stencilDist.createArray({0..<n}, fluff=(1,), int); forall i in A.domain { ... A[i-1] ... A[i] ... A[i+1] ... }The following compiler flags can be used to investigate and modify ALA behavior
--report-auto-local-access
: Generates a report showing which accesses are optimized and which are not, with some explanation as to what caused the final decision for the optimization.
--no-auto-local-access
: Disables the optimization. We are not aware of a case where ALA hurts performance. There could be a slight improvement in compilation time by disabling the optimization, but we haven’t observed it to have a significant impact.
--no-dynamic-auto-local-access
: Disables the dynamic portion of the optimization. In some cases, the compiler can introduce loop clones, where the executed clone will be chosen at execution time depending on some dynamic checks. This behavior can be turned off with this flag, where the optimization will be effective only when it can be proven statically
Fine-Grained Communication¶
It’s easy to write Chapel programs that use fine-grained communication. Such programs will work with many small messages and the performance will be sensitive to the latency of a message in the network. To make such programs perform and scale better, try to get the communication to use larger messages so that the network is operating more in a bandwidth-bound way rather than a latency-bound way.
Let’s look at an example of permuting an array. Suppose that we have
three distributed arrays all over the distributed domain D
:
In
stores input values
Idx
stores a permutation; each index inD
occurs exactly once inIdx
.
Out
stores the result of permutingIn
according toIdx
In particular, we will set Out[Idx[i]] = In[i];
for each i
in D
.
If we write that in the simplest way, we get this code:
forall i in D {
Out[Idx[i]] = In[i];
}
However that code will do fine-grained PUT operations on the network;
Out[Idx[i]]
will refer to the array element to store into, and since
that array element is typically stored on a different locale, the
compiler will generate a PUT operation for Out[Idx[i]] = In[i]
.
We can optimize this by using the CopyAggregation
module.
Here is the optimized version:
forall i in D with (var agg = new DstAggregator(int)) {
agg.copy(Out[Idx[i]], In[i]);
}
In this case, the optimization required two changes to the program:
Create a copy aggregator per-task. Here, we use the
forall
intentwith
clause to create aDstAggregator
per task. TheDstAggregator
will optimize the PUT operations by combining PUT operations destined for the same locale.Use the copy aggregator instead of plain old assignment. The call
agg.copy(X, Y)
is functionally similar toX = Y
.
How much does optimizing this example improve performance? With a quick
test on 16 nodes, we can see (through use of the
CommDiagnostics
module) that the number of PUTs is
approximately 1000 times less with the optimized version. By timing the
relevant loop, we can see that the performance is approximately 30 times
better. In this simple case, with the --auto-aggregation
flag, the
compiler can even make these adjustments automatically.
Also note that the CopyAggregation
module used here is most
applicable when working with random access. If you are copying contiguous
regions of arrays, using an optimized slice assignment can perform well
as well to perform bulk communication. See Creating Too Many Distributed Objects / Unoptimized Slice Assignments for
some important caveats.
Load Imbalance¶
The term load imbalance describes a situation where a parallel program won’t run as fast as it could because the parallel work is not evenly divided up among the hardware elements doing the work. As a contrived example of load imbalance, suppose you have 4 locales, and locale 0 does 4 times as much work as the other locales. Locales 1, 2, and 3 will spend most of their time waiting for locale 0, which is not ideal.
Load imbalance can be a particularly problematic issue when working with
a parallel computation that has different phases. For Chapel programs,
these phases are probably represented by different parallel loops or with
barriers. If one task in a parallel region (such as a forall
or
coforall
loop) takes a significantly longer time than the other
tasks, the time the loop takes will be this long time, since the program
cannot proceed with the code after the loop until all of the tasks have
completed.
How can load imbalance be identified?
Within a parallel region, different tasks are taking very different amounts of time
When profiling, the program spends a lot of time waiting for tasks to complete or waiting for barriers
How can load imbalance be addressed?
There are often ways to improve the algorthim to address load imbalance. For example, graph partitioners are an important technology that can help one balance the storage of a data structure and the computation that goes with it. More generally, you might be able to rearrange the data and the computation to be performed so that when working with the new arrangement there is better load balance.
In some cases,
DynamicIters
might help. You can use it to write parallel loops that dynamically load balance.
Current Issues¶
This section contains issues that, ideally, the Chapel compiler and runtime would address. However, as they may come up in practice, it’s important to be aware of them and their workarounds.
Distributed Array Field Access Can Result in Unnecessary Communication¶
Issue: https://github.com/chapel-lang/chapel/issues/10160
When a class contains a field that is a distributed array, a distributed
forall
loop using a class instance will generate communication in
order to read the field’s value (repeatedly) even though the field’s
value should be a privatized distributed array.
This issue can be avoided by creating a ref
or const ref
that
refers to the distributed array. This ref
or const ref
can be
created outside of the forall
loop and reused within it to avoid the
problem.
Creating Too Many Distributed Objects / Unoptimized Slice Assignments¶
Issue: https://github.com/chapel-lang/chapel/issues/16133
In principle, distributed array creation does some work on each locale. As a result, it’s not going to go faster when adding more locales. That can cause performance or scaling issues if a program tries to create too many distributed arrays.
One aspect of the current implementation is that array slices of
distributed arrays are also distributed objects and work is required on
each locale to create one. That means that creating a slice, as in
MyDistributedArray[1..10]
can actually be quite slow.
One common scenario is that data needs to be copied between regions of
arrays. For example
MyDistributedArray[1..10] = MyOtherArray[11..20];
.
The compiler can optimize this kind of
assignment in many cases today with Array View Elision (AVE). However, in
order to optimize it in this way, both sides of the =
need to be
slice expressions. Note that in some cases, one might create a local
array to store the contents of a remote slice; that can be done with AVE
by redundantly slicing the local array, e.g.:
var Loc:[1..10] int; Loc[1..10] = MyDistributedArray[1..10];`
Unoptimized Distributions¶
Issue: https://github.com/chapel-lang/chapel/issues/27334
The Chapel programming language is designed to support many distributions
for domains and arrays. That includes distributions created by users.
However, the present situation is that performance of Chapel programs
depends on optimizations in the implementations of these domain/array
distributions. These optimizations are present and reasonably well tuned
for the BlockDist
distribution. Other distributions might not
be optimized and have scaling issues.
Performance Problems with Multidimensional Zippered Iteration¶
Issue: https://github.com/chapel-lang/chapel/issues/13147
Zippered iteration for multidimensional arrays/domains is much slower
than zippered iteration for 1D arrays/domains. Since promoted calls, such
as MyArray + MyOtherArray
are implemented with zippered iteration,
this problem also applies to that case.
Potential ways to avoid this problem:
use loops over a multidimensional domain that avoid zippering
express the computation with nested loops per dimension
use 1D arrays and explicitly compute 1D indices from 2D conceptual indices
create a 1D copy of the array (with
reshape
– note that in the future, we expect to have a way toreshape
without copying)
Cooperative Scheduling and Remote Tasks¶
Issue: https://github.com/chapel-lang/chapel/issues/27332
Chapel’s tasking model currently uses cooperative scheduling. That means that, once a task starts running on a core, it will be the only thing running on that core until the task either yields execution or ends.
Occasionally, this can lead to problems when combined with on
statements. The on
statements generally create a task on a remote
locale. Those remote tasks will never get a chance to run if all of the
cores are busy with existing tasks. This problem is rare, but it can
cause performance issues if it comes up. The solution is to
periodically call currentTask.yieldExecution()
from any polling loops. This is already done in
waitFor()
methods on atomics (see also
Atomic Variables).
Tools for Understanding Performance¶
Note
This section is under construction. Contributions are welcome!
Tools for Understanding Communication¶
Several tools are available to help optimize communication in order to improve the distributed-memory scalability of your program.
local
blockslocal
blocks are an unstable feature that instructs the compiler to assume that there will be no communication in the body of thelocal
block. Moreover, if compiling with--checks
(generally the case without--fast
) the program will halt if code in alocal
block requires communication. As a result,local
blocks are useful both as an optimization tool and a way of discovering unintended communication.CommDiagnostics
on-the-fly reportingThe
CommDiagnostics
module provides mechanisms for on-the-fly reporting withstartVerboseComm()
. This on-the-fly reporting can even include stack traces if you compile with-scommDiagsStacktrace=true
and have built Chapel withCHPL_UNWIND != none
(see CHPL_UNWIND). The on-the-fly reporting provides a relatively easy way to see what communication events are common in your program. It can be a lot of output, though.CommDiagnostics
comm countingCommDiagnostics
also provides a way to count communication events. Note that the communication count information will be easier to understand if you compile with--no-cache-remote
since that will reduce the number of categories of events. Comm counts information provide a way to compare communication performance independent of where you are running. For example, you might measure and seek to reduce the communication counts when running oversubscribed on a laptop or workstation. Reductions in communication counts from such an effort should also help with multi-node performance when you are able to run on a big system.