Using Chapel with ugni

Chapel supports a Cray-specific ugni communication layer for targeting the Aries network on Cray-XC’s. The ugni communication layer interacts with the system’s network interface very closely through a lightweight interface called uGNI (user Generic Network Interface). On Cray XC systems the ugni communication layer is the default.

Using the ugni Communications Layer

To use ugni communications:

  1. Leave your CHPL_COMM environment variable unset or set it to ugni:

    export CHPL_COMM=ugni
    

    This specifies that you wish to use the Cray-specific communication layer.

  2. (Optional) Load an appropriate craype-hugepages module. For example:

    module load craype-hugepages16M
    

    The ugni communication layer can be used with or without so-called hugepages. Performance for remote variable references is much better when hugepages are used. The only downside of using hugepages is that the tasking layer may not be able to detect task stack overflows by means of guard pages, see below for more information.

    To use hugepages, you must have a craype-hugepages module loaded both when building your program and when running it. There are several hugepage modules, with suffixes indicating the page size they support. For example, craype-hugepages16M supports 16 MiB hugepages. It does not matter which craype-hugepages module you have loaded when you build your program. Any of them will do. Which one you have loaded when you run a program does matter, however. For general use, the Chapel group recommends the craype-hugepages16M module. You can read on for more information about hugepage modules if you would like, but the recommended craype-hugepages16M module will probably give you satisfactory results.

    The Cray network interface chips (NICs) can only address memory that has been registered with them. In practical terms, the Aries(TM) NIC on Cray XC systems is not limited as to how much memory it can register. However, it does have an on-board cache of 512 registered page table entries, and registering more than this can cause reduced performance if the program’s memory reference pattern causes refills in this cache. We have seen up to a 15% reduction from typical nightly XC-16 performance in an ra-rmo run using hugepages small enough that every reference should have missed in this cache. Covering an entire 128 GiB XC compute node with only 512 hugepages will require at least the craype-hugepages256M module’s 256 MiB hugepages.

    Offsetting this, using larger hugepages may reduce performance because it can result in poorer NUMA affinity. With the ugni communication layer, arrays larger than 2 hugepages are allocated separately from the heap, which improves NUMA affinity. An obvious side effect of using larger hugepages is that an array has to be larger to qualify. Thus, achieving the best performance for any given program may require striking a balance between using larger hugepages to reduce NIC page table cache refills and using smaller ones to improve NUMA locality.

    Note that when hugepages are used with the ugni comm layer, tasking layers cannot use guard pages for stack overflow detection. Qthreads tasking cannot detect stack overflow except by means of guard pages, so if ugni communications is combined with qthreads tasking and a hugepage module is loaded, stack overflow detection is unavailable.

Network Atomics

The Aries networks on Cray XC series systems support remote atomic memory operations (AMOs). When the CHPL_NETWORK_ATOMICS environment variable is set to ugni, the following operations on remote atomics are done using the network:

  • 32- and 64-bit signed and unsigned integer types and real types:

    • read()

    • write()

    • exchange()

    • compareAndSwap()

    • add(), fetchAdd()

    • sub(), fetchSub()

  • 32- and 64-bit signed and unsigned integer types:

    • or(), fetchOr()

    • and(), fetchAnd()

    • xor(), fetchXor()

All of the operations shown above are done natively by the network hardware except 64-bit real add, which is disabled in hardware and thus done using on statements.

ugni Communication Layer and the Heap

The “heap” is an area of memory used for dynamic allocation of everything from user data to internal management data structures. When running on Cray XC systems using the default configuration with the ugni comm layer and a craype-hugepages module loaded, the heap is used for all dynamic allocations except data space for arrays larger than 2 hugepages. (See Using the ugni Communications Layer, just above, for more about hugepages.) It is normally extended dynamically, as needed. But if desired, the heap can instead be created at a specified fixed size at the beginning of execution. In some cases this will reduce certain internal comm layer overheads and marginally improve performance.

The disadvantage of a fixed heap is that it usually produces worse NUMA affinity, it limits available heap memory to the specified fixed size, and it limits memory for arrays to whatever remains after the fixed-size heap is created. If either of the latter are less than what a program needs, it will terminate prematurely with an “Out of memory” message.

To specify a fixed heap, set the CHPL_RT_MAX_HEAP_SIZE environment variable to indicate its size. For the value of this variable you can use any of the following formats, where num is a positive integer number:

Format

Resulting Heap Size

num

num bytes

num[kK]

num * 2**10 bytes

num[mM]

num * 2**20 bytes

num[gG]

num * 2**30 bytes

num%

percentage of compute node physical memory

Any of the following would specify an approximately 1 GiB heap on a 128-GiB compute node, for example:

export CHPL_RT_MAX_HEAP_SIZE=1073741824
export CHPL_RT_MAX_HEAP_SIZE=1048576k
export CHPL_RT_MAX_HEAP_SIZE=1024m
export CHPL_RT_MAX_HEAP_SIZE=1g
export CHPL_RT_MAX_HEAP_SIZE=1% # 1.28 GiB, really

Note that the resulting heap size may get rounded up to match the page alignment. How much this will add, if any, depends on the hugepage size in any craype-hugepages module you have loaded at the time you execute the program. It may also be reduced, if some resource limitation prevents making the heap as large as requested.

ugni Communication Layer Registered Memory Regions

The ugni communication layer maintains information about every memory region it registers with Aries NIC. Roughly speaking there are a few memory regions for each tasking layer thread, plus one for each array larger than 2 hugepages allocated and registered separately from the heap. By default the comm layer can handle up to 16k (2**14) total memory regions, which is plenty under normal circumstances. In the event a program needs more than this, a message like the following will be printed:

warning: no more registered memory region table entries (max is 16384).
         Change using CHPL_RT_COMM_UGNI_MAX_MEM_REGIONS.

To provide for more registered regions, set the CHPL_RT_COMM_UGNI_MAX_MEM_REGIONS environment variable to a number indicating how many you want to allow. For example:

export CHPL_RT_COMM_UGNI_MAX_MEM_REGIONS=30000

Note that there are certain comm layer overheads that are proportional to the number of registered memory regions, so allowing a very high number of them may lead to reduced performance.