Using Chapel with ugni
Chapel supports a Cray-specific ugni
communication layer for targeting the
Aries network on Cray-XC’s. The ugni communication layer interacts with the
system’s network interface very closely through a lightweight interface called
uGNI (user Generic Network Interface). On Cray XC systems the ugni
communication layer is the default.
Using the ugni Communications Layer
To use ugni communications:
Leave your
CHPL_COMM
environment variable unset or set it tougni
:export CHPL_COMM=ugni
This specifies that you wish to use the Cray-specific communication layer.
(Optional) Load an appropriate
craype-hugepages
module. For example:module load craype-hugepages16M
The ugni communication layer can be used with or without so-called hugepages. Performance for remote variable references is much better when hugepages are used. The only downside of using hugepages is that the tasking layer may not be able to detect task stack overflows by means of guard pages, see below for more information.
To use hugepages, you must have a
craype-hugepages
module loaded both when building your program and when running it. There are several hugepage modules, with suffixes indicating the page size they support. For example,craype-hugepages16M
supports 16 MiB hugepages. It does not matter whichcraype-hugepages
module you have loaded when you build your program. Any of them will do. Which one you have loaded when you run a program does matter, however. For general use, the Chapel group recommends thecraype-hugepages16M
module. You can read on for more information about hugepage modules if you would like, but the recommendedcraype-hugepages16M
module will probably give you satisfactory results.The Cray network interface chips (NICs) can only address memory that has been registered with them. In practical terms, the Aries(TM) NIC on Cray XC systems is not limited as to how much memory it can register. However, it does have an on-board cache of 512 registered page table entries, and registering more than this can cause reduced performance if the program’s memory reference pattern causes refills in this cache. We have seen up to a 15% reduction from typical nightly XC-16 performance in an ra-rmo run using hugepages small enough that every reference should have missed in this cache. Covering an entire 128 GiB XC compute node with only 512 hugepages will require at least the
craype-hugepages256M
module’s 256 MiB hugepages.Offsetting this, using larger hugepages may reduce performance because it can result in poorer NUMA affinity. With the ugni communication layer, arrays larger than 2 hugepages are allocated separately from the heap, which improves NUMA affinity. An obvious side effect of using larger hugepages is that an array has to be larger to qualify. Thus, achieving the best performance for any given program may require striking a balance between using larger hugepages to reduce NIC page table cache refills and using smaller ones to improve NUMA locality.
Note that when hugepages are used with the ugni comm layer, tasking layers cannot use guard pages for stack overflow detection. Qthreads tasking cannot detect stack overflow except by means of guard pages, so if ugni communications is combined with qthreads tasking and a hugepage module is loaded, stack overflow detection is unavailable.
Network Atomics
The Aries networks on Cray XC series systems support remote atomic
memory operations (AMOs). When the CHPL_NETWORK_ATOMICS
environment
variable is set to ugni
, the following operations on remote atomics
are done using the network:
32- and 64-bit signed and unsigned integer types and real types:
read()
write()
exchange()
compareAndSwap()
add()
,fetchAdd()
sub()
,fetchSub()
32- and 64-bit signed and unsigned integer types:
or()
,fetchOr()
and()
,fetchAnd()
xor()
,fetchXor()
All of the operations shown above are done natively by the network
hardware except 64-bit real add, which is disabled in hardware and thus
done using on
statements.
ugni Communication Layer and the Heap
The “heap” is an area of memory used for dynamic allocation of
everything from user data to internal management data structures.
When running on Cray XC systems using the default configuration
with the ugni comm layer and a craype-hugepages
module loaded, the
heap is used for all dynamic allocations except data space for arrays
larger than 2 hugepages. (See Using the ugni Communications Layer,
just above, for more about hugepages.) It is normally extended
dynamically, as needed. But if desired, the heap can instead be created
at a specified fixed size at the beginning of execution. In some cases
this will reduce certain internal comm layer overheads and marginally
improve performance.
The disadvantage of a fixed heap is that it usually produces worse NUMA affinity, it limits available heap memory to the specified fixed size, and it limits memory for arrays to whatever remains after the fixed-size heap is created. If either of the latter are less than what a program needs, it will terminate prematurely with an “Out of memory” message.
To specify a fixed heap, set the CHPL_RT_MAX_HEAP_SIZE
environment
variable to indicate its size. For the value of this variable you can
use any of the following formats, where num is a positive integer
number:
Format
Resulting Heap Size
num
num bytes
num[kK]
num * 2**10 bytes
num[mM]
num * 2**20 bytes
num[gG]
num * 2**30 bytes
num%
percentage of compute node physical memory
Any of the following would specify an approximately 1 GiB heap on a 128-GiB compute node, for example:
export CHPL_RT_MAX_HEAP_SIZE=1073741824 export CHPL_RT_MAX_HEAP_SIZE=1048576k export CHPL_RT_MAX_HEAP_SIZE=1024m export CHPL_RT_MAX_HEAP_SIZE=1g export CHPL_RT_MAX_HEAP_SIZE=1% # 1.28 GiB, really
Note that the resulting heap size may get rounded up to match the page
alignment. How much this will add, if any, depends on the hugepage size
in any craype-hugepages
module you have loaded at the time you
execute the program. It may also be reduced, if some resource
limitation prevents making the heap as large as requested.
ugni Communication Layer Registered Memory Regions
The ugni communication layer maintains information about every memory region it registers with Aries NIC. Roughly speaking there are a few memory regions for each tasking layer thread, plus one for each array larger than 2 hugepages allocated and registered separately from the heap. By default the comm layer can handle up to 16k (2**14) total memory regions, which is plenty under normal circumstances. In the event a program needs more than this, a message like the following will be printed:
warning: no more registered memory region table entries (max is 16384). Change using CHPL_RT_COMM_UGNI_MAX_MEM_REGIONS.
To provide for more registered regions, set the
CHPL_RT_COMM_UGNI_MAX_MEM_REGIONS
environment variable to a number
indicating how many you want to allow. For example:
export CHPL_RT_COMM_UGNI_MAX_MEM_REGIONS=30000
Note that there are certain comm layer overheads that are proportional to the number of registered memory regions, so allowing a very high number of them may lead to reduced performance.