Using Chapel with libfabric¶
This document describes how to run Chapel across multiple compute nodes using
the OpenFabrics Interfaces libfabric-based ofi
communication layer.
Multilocale Chapel Execution gives general information about running Chapel
in a multilocale configuration.
Building Chapel with the ofi Communication Layer¶
Make general, non-communication Chapel configuration settings as described in Setting up Your Environment for Chapel.
Configure the Chapel runtime to select the ofi communication layer
export CHPL_COMM=ofi
Set the
CHPL_LIBFABRIC
environment variable indicating whether to use a system-installed libfabric or the bundled one. Current options are:Value
Description
system
use a system install of libfabric
bundled
use the libfabric bundled with Chapel in third-party
If unset, Chapel will attempt to find a libfabric installation using the command
pkg-config --exists libfabric
with the specific paths involved filled in using other pkg-config options. Note that in particular on HPE Cray EX systems, we expect to always use a system-installed libfabric, so much so that if the environment variable is set to ‘bundled’ or the default search fails to find a libfabric on such systems, a warning will result.
Select a launcher. On Cray XC and HPE Cray EX systems you can skip this step, because on those systems the automatically-selected
aprun
orsrun
launcher settings will work with the ofi communication layer. But on other systems, select the appropriate launcher. If using a slurm-based cluster, use the Chapelslurm-srun
launcher.export CHPL_LAUNCHER=slurm-srun
Otherwise, use the Chapel
mpirun4ofi
launcher. For more information see The mpirun4ofi Launcher.export CHPL_LAUNCHER=mpirun4ofi
If you are using the
mpirun4ofi
launcher, set the variable indicating the path to an OpenMPI installation. It may be possible to skip this step, if your system has OpenMPI already installed and your target compiler can find its include and library files itself. But this is not common, so you will probably need to do:export MPI_DIR=<Place where OpenMPI is installed>
The
<Place where OpenMPI is installed>
should be a directory with aninclude
subdirectory that contains the OpenMPI include files and alib
subdirectory that contains the OpenMPI library files. If your system does not have those installed already, you will need to download OpenMPI and possibly build it.Note
In the future we hope both to be able to support the MPICH MPI package in addition to supporting the OpenMPI MPI package, and to use the regular Chapel
mpirun
launcher with the ofi communication layer, but for now OpenMPI and mpirun4ofi are the only options on platforms other than Cray XC and HPE Cray EX systems.Note: On a Mac OS X system, OpenMPI can be obtained through Homebrew with the following command.
brew install open-mpi
If you are not on a Cray XC or HPE Cray EX system and you are not using the
mpirun4ofi
launcher, you may need to setCHPL_COMM_OFI_OOB
.Value
Description
pmi2
use the PMI2 (Process Management Interface) out-of-band (OOB) mechanism
mpi
use the MPI out-of-band (OOB) mechanism
sockets
use the sockets out-of-band (OOB) mechanism
On Cray XC and HPE Cray EX systems, the default is
pmi2
. If the launcher ismpirun4ofi
, the default ismpi
. Otherwise, Chapel will fallback tosockets
, which is not recommended. If you are using a slurm-based launcher, you should setCHPL_COMM_OFI_OOB=pmi2
.Re-make the compiler and runtime from
CHPL_HOME
(see Building Chapel).cd $CHPL_HOME make
Now you are ready to compile and run programs. Compile your Chapel program as usual.
chpl $CHPL_HOME/examples/hello6-taskpar-dist.chpl
Optionally set any environment variables necessary during execution (see below) and run, specifying the number of locales on the command line. For example, this runs the
hello6-taskpar-dist
example program on 2 locales:./hello6-taskpar-dist -nl 2
Execution Environment¶
Libfabric Providers¶
Libfabric defines an abstract network and operations on it, and so-called providers within libfabric define the concrete instances of the network and operations. The provider used by a program is selected at execution time. The ofi communication layer has been tested with 4 different providers:
- gni
The
gni
provider works only on Cray XC systems. It is built on the Cray native uGNI library and communicates over the Cray proprietary Aries network interface. This is the default provider on Cray XC systems. Note that the libfabric gni provider itself is something of a work-in-progress, and Chapel performance using libfabric and gni will probably never match what can be achieved using the native ugni communication layer.- tcp
The
tcp
provider should work on all platforms. It is built on TCP sockets and communicates over any network interface on which the OS can provide sockets support. This is the default provider on all systems other than Cray XC and HPE Cray EX. The tcp provider is fully functional, indeed to the extent libfabric has a reference provider the tcp provider is it, but its emphasis is definitely functionality rather than performance.- efa
The
efa
provider works on AWS EC2 instances with Elastic Fabric Adapter (EFA) support. This is the default provider on AWS EC2 instances with EFA support.- verbs
The
verbs
provider works on any system with verbs-based network hardware (Infiniband, iWarp, etc.). It is built on the Linux Verbs API.(Note for libfabric devotees: when the verbs provider is specified to the ofi communication layer as described below, what is actually used is the
verbs;ofi_rxm
provider, which is the verbs provider plus a utility provider which supports reliable datagrams for remote memory access operations.)
The CHPL_RT_COMM_OFI_PROVIDER
or FI_PROVIDER
environment
variables can be set to force use of a provider other than the default,
with the former overriding the latter if both are set. In particular,
this can force use of the tcp provider on Cray XC systems, or the verbs
provider on verbs-based systems where the default would otherwise be the
tcp provider. For example, the following would force use of the verbs
provider:
export CHPL_RT_COMM_OFI_PROVIDER=verbs
The Chapel group has done full testing both on a Cray XC system with the gni and tcp providers, and on InfiniBand-based Cray CS and HPE Cray EX systems with the tcp and verbs providers. Some additional testing has been done with the tcp provider on a MacBook running Mac OS X. All of these configurations are expected to work. Provider settings we have not tested with the ofi communication layer may lead to internal errors. Settings which are at odds with the available networks, such as specifying the gni provider on a vanilla Linux cluster, will definitely lead to internal errors.
Although the discussion here has revolved around selecting providers as such, in reality the selection is capability-based and precisely which provider ends up getting selected doesn’t particularly matter to the communication layer. There is no expectation that users or programmers should concern themselves with providers. Providers are simply the mechanism of portability when using libfabric.
Memory Registration and the Heap¶
Network technologies sometimes require memory registration, meaning that ranges of memory which will be the source or target of communication operations must be made known to the network before any such operations can occur. When the ofi communication layer is used with either the verbs provider on InfiniBand-based platforms including HPE Cray EX systems, or with the gni provider on Cray XC systems, memory has to be registered. This has certain implications for users, the most notable being that the heap must have a fixed size.
The heap is an area of memory used for dynamic allocation of everything from user data to task stacks to internal management data structures. When memory must be registered, the communication layer needs to know the maximum size the heap will grow to during execution. By default, the ofi communication layer creates a fixed heap whose size is 85% of compute node physical memory when it predicts that doing so will result in better network performance.
You can adjust this by setting the
CHPL_RT_MAX_HEAP_SIZE
environment variable. Set it to a positive
number for the desired heap size in bytes optionally followed by k
or K
for KiB, m
or M
for MiB, g
or G
for GiB, or to
a positive integer followed by %
to indicate a percentage of the
node real memory. Either CHPL_RT_MAX_HEAP_SIZE=12g
or =20%
specifies roughly a 12 GiB heap on a 64 GiB compute node, for example.
Alternatively, you can prevent creation of a fixed heap entirely by
setting CHPL_RT_MAX_HEAP_SIZE=0
. This may cause the selection of a
different provider than the highest-performing one, however.
CHPL_RT_MAX_HEAP_SIZE
sets the heap size per compute node. If the
application is run with multiple locales per node (co-locales), the heap size
will be divided equally among the co-locales. For example, setting
CHPL_RT_MAX_HEAP_SIZE=100G
and running with 4 co-locales (i.e., -nl Nx4
) will
result in a 25 GiB heap per co-locale.
Note
In the future we hope to be able to reduce the user impact of memory registration and fixed heaps when using the ofi communication layer.
Hugepages on Cray XC and HPE Cray EX Systems¶
We have not yet quantified the effects, but performance with the gni and
cxi providers may be improved through the use of hugepages. To use
hugepages you must have a craype-hugepages
module loaded and
CHPL_COMM_OFI_USE_HUGEPAGES=true
when you build Chapel and compile
your program, and you must have a craype-hugepages
module loaded
and CHPL_RT_COMM_OFI_USE_HUGEPAGES=true
when you run your program
(note that these are not the same variable.)
For example:
module load craype-hugepages16M
CHPL_COMM_OFI_USE_HUGEPAGES=true
CHPL_RT_COMM_OFI_USE_HUGEPAGES=true
See Native ugni Communication Layer for more discussion about hugepages,
hugepage modules, and the heap size. Note, however, that anything
there about a dynamically sized heap does not apply to the ofi
communication layer and the libfabric gni and cxi providers. Also note
that the ofi communication layer cannot test the size of a hugepage
fixed heap during initialization; as a result, it will create a heap of
size CHPL_RT_MAX_HEAP_SIZE
even if there is insufficient physical
memory, which may lead to out-of-memory errors during program
execution.
Transparent Hugepages¶
With some providers, the use of transparent hugepages is required to register
more memory. This is default when using the EFA provider on AWS. To explicitly
enable transparent hugepages, set the CHPL_RT_COMM_OFI_THP_HINT
environment
variable to true. This will enable transparent hugepages for the duration of
the program. Note that this requires the kernel to have transparent hugepages
enabled and enough free hugepages to be available.
The mpirun4ofi Launcher¶
Programs built with the ofi communication layer on Cray XC and HPE Cray
EX systems can use the existing launchers. On other systems, for now
they must use the mpirun4ofi
launcher, which is a provisional, thin
wrapper around OpenMPI mpirun
.
The mpirun4ofi launcher can run libfabric-based Chapel programs either with or without slurm. Outside of a slurm job, it will run all of the per-locale Chapel program instances directly on the launch node. In this situation you should be sure to follow the guidance in Overloading system nodes if you are using Qthreads-based tasking. Within a slurm job, the mpirun4ofi launcher will arrange for the per-locale Chapel program instances to be distributed in a cyclic manner across the nodes assigned to the job. Overloading can still be an issue if there are more Chapel locales (program instances) than nodes in the slurm job, however.