GPU Programming

Chapel enables developers to use parallelism at different levels: from intra-node multicore parallelism, to cross-node distributed parallelism, to GPUs. This technote serves as a reference on how to use Chapel to program GPUs. Specifically, it gives a quick overview of GPU programming, includes a handful of examples, discusses system requirements and current limitations for GPU support, and delves into more details on some specific GPU-related features.

Readers preferring a more tutorial-like introduction to Chapel’s GPU support, may also wish to look at our GPU Programming in Chapel blog series.

Warning

This work is under active development. As such, the interface is unstable and expected to change.

Overview

The Chapel compiler will generate GPU kernels for certain parallel operations such as forall/foreach loops, reduce expressions and promoted expressions. These will be launched onto a GPU when the current locale (e.g. here) is the sublocale representing that particluar GPU. To deploy code to a GPU, put the relevant code in an on statement targeting a GPU sublocale (i.e. here.gpus[0]).

Any arrays that are declared by tasks executing on a GPU sublocale will, by default, be accessible on the GPU (see the Memory Strategies subsection for more information about alternate memory strategies).

Chapel will launch kernels for all eligible data-parallel operations that are encountered by tasks executing on a GPU sublocale. Expressions are eligible when:

  • They are order-independent, such as:

    • forall or foreach loops over iterators that are also order-independent (i.e. the yielding loop uses foreach loops instead of for. All Chapel iterators of ranges, domains and arrays are order-independent),

    • reduce expressions over order-independent iterators,

    • Promoted expressions over order-independent iterators.

  • They do not call out to extern functions (aside from those in an exempted set of Chapel runtime functions).

  • They do not allocate memory dynamically (i.e. no class instances or Chapel arrays are created within).

  • They are free of any call to a function that fails to meet the above criteria or accesses outer variables.

Any code in an on statement for a GPU sublocale that is not within an eligible loop will be executed on the CPU.

Examples

The following example illustrates running a computation on a GPU as well as a CPU. When jacobi is called with a GPU locale it will allocate the arrays A and B on the device memory of the GPU and we generate three GPU kernels for the forall loops in the function.

config const nSteps = 10;
config const n = 10;

writeln("on GPU:");
jacobi(here.gpus[0]);
writeln("on CPU:");
jacobi(here);

proc jacobi(loc) {
  on loc {
    var A, B: [0..n+1] real;

    A[0] = 1; A[n+1] = 1;
    forall i in 1..n { A[i] = i:real; }

    for step in 1..nSteps {
      forall i in 1..n { B[i] = 0.33333 * (A[i-1] + A[i] + A[i+1]); }
      forall i in 1..n { A[i] = 0.33333 * (B[i-1] + B[i] + B[i+1]); }
    }
    writeln(A);
  }
}

For additional examples we suggest looking at some of our internal tests. Note that these are not packaged in the Chapel release but are accessible from our public Github repository.

Tests of particular interest include:

Benchmark examples

  • Jacobi – Jacobi example (shown above)

  • Stream – GPU enabled version of Stream benchmark

  • SHOC Triad (Direct) – a transliterated version of the SHOC Triad benchmark

  • SHOC Triad (Chapeltastic) – a version of the SHOC benchmark simplified to use Chapel language features (such as promotion)

  • SHOC Sort – SHOC radix sort benchmark

  • asyncTaskComm – a synthetic benchmark to test overlap performance using multiple Chapel tasks.

Test examples

  • assertOnFailToGpuize – various examples of loops that are not eligible for GPU execution

  • mathOps – calls to various math functions within kernels that call out to the CUDA Math library

  • measureGpuCycles – measuring time within a GPU kernel

  • promotion2 – GPU kernels from promoted expressions

Examples with multiple GPUs

  • multiGPU – simple example using all GPUs within a locale

  • workSharing – stream-like example showing computation shared between GPUs and CPU

  • onAllGpusOnAllLocales – simple example using all GPUs and locales

  • copyToLocaleThenToGpu – stream-like example (with data initialized on Locale 0 then transferred to other locales and GPUs)

Setup

Requirements

First, please make sure you are using Chapel’s preferred configuration as the starting point. Specifically, the “quickstart” configuration can not be used with GPU support.

The following are further requirements for GPU support:

  • For targeting NVIDIA or AMD GPUs, the default LLVM backend must be used as Chapel’s backend compiler (i.e. CHPL_LLVM must be set to system or bundled).

    • Note that CHPL_TARGET_COMPILER must be llvm. This is the default when CHPL_LLVM is set to system or bundled.

  • The environment variable CHPL_LOCALE_MODEL must be set to gpu.

  • Specifically for targeting NVIDIA GPUs:

    • CUDA toolkit version 11.x or 12.x must be installed.

    • We test with system LLVM 18. Older versions may work.

      • Note that LLVM versions older than 16 do not support CUDA 12.

    • If using CHPL_LLVM=system, it must have been built with support for NVPTX target. You can check supported targets of your LLVM installation by running llvm-config --targets-built.

  • Specifically for targeting AMD GPUs:

    • ROCm version between 5.0 and 5.4 or between ROCm 6.0 and 6.2 must be installed.

    • For ROCm 5.x, CHPL_LLVM must be set to system. Note that, ROCm installations come with LLVM. Setting CHPL_LLVM=system will allow you to use that LLVM.

    • For ROCm 6.x, only CHPL_LLVM=bundled is supported.

  • Specifically for using the CPU-as-Device mode:

    • CHPL_GPU=cpu must be explicitly set. In other words, Chapel will not automatically fall back to this mode simply because it can’t detect GPUs.

Features

In the following subsections we discuss various features of GPU supports.

Vendor Portability

Chapel is able to generate code that will execute on either NVIDIA or AMD GPUs. Chapel’s build system will automatically try and deduce what type of GPU you have and where your installation of relevant runtime (e.g. CUDA or ROCm) are. If the type of GPU is not detected you may set the CHPL_GPU environment variable manually to either nvidia or amd. CHPL_GPU may also manually be set to cpu to use CPU-as-Device mode.

Based on the value of CHPL_GPU, Chapel’s build system will also attempt to automatically detect the path to the relevant runtime. If it is not automatically detected (or you would like to use a different installation) you may set CHPL_CUDA_PATH and/or CHPL_ROCM_PATH explicitly.

The CHPL_GPU_ARCH environment variable can be set to control the desired GPU architecture to compile for. The default value is sm_60 for CHPL_GPU=nvidia. You may also use the --gpu-arch compiler flag to set GPU architecture. If using AMD, this variable must be set. This table in the ROCm documentation has possible architecture values (see the “LLVM target name” column). For NVIDIA, see the CUDA Compute Capability table.

For NVIDIA, the CHPL_GPU_ARCH variable can also be set to a comma-separated list. This causes the Chapel compiler to generate device code for each of the given compute capabilities, and to bundle the different versions in a single executable. When the program is executed, the compute capability best suited for the available GPU will be loaded by the CUDA runtime. Support for this feature for AMD GPUs is planned, but not currently available.

CPU-as-Device Mode

The CHPL_GPU environment variable can be set to cpu to enable many GPU features to be used without requiring any GPUs and/or vendor SDKs to be installed. This mode is mainly for initial development steps or quick feature tests where access to GPUs may be limited. In this mode:

  • The compiler will generate GPU kernels from eligible loops normally.

  • It will call the internal runtime API for GPU operations, so that features outlined under Diagnostics and Utilities will work as expected.

    • For example, the @assertOnGpu attribute will fail at compile time for ineligible loops normally. This can allow testing if a loop is GPU-eligible. It will generate a warning per-iteration at execution time. The CHPL_GPU_NO_CPU_MODE_WARNING environment can be set to suppress these warnings. Alternatively, you can pass --gpuNoCpuModeWarning to your application to the same effect.

    • Note that data movements between device and host will not be captured by the GpuDiagnostics module in this mode.

  • Even though the kernel launches will be registered by GPU diagnostics, the loop will be executed for correctness testing and there will not be any actual kernel launch even if you have a GPU available.

  • Advanced features like syncThreads and createSharedArray will compile and run, but in all likelihood code that uses those features will not generate correct results.

  • The asyncGpuComm procedure will do a blocking memcpy and gpuCommWait will return immediately.

  • There will be one GPU sublocale per locale by default. CHPL_RT_NUM_GPUS_PER_LOCALE can be set to control how many GPU sublocales will be created per locale.

  • Inner loops in loop nests that consist of GPU-eligible loops will be reported as kernel launch whereas in regular GPU modes, such loops will not be launched as a kernel as the execution will already be on the GPU. This may cause increased kernel launches reported by the GpuDiagnostics utilities with loop nests and multidimensional loops.

Warning

This mode should not be used for performance studies. Application correctness is not guaranteed in complex cases.

Diagnostics and Utilities

The GpuDiagnostics module contains functions to help users count and track kernel launches and data movement between host and device(s).

To count the number of kernel launches that occur in a section of code, surround that code with calls to startGpuDiagnostics and stopGpuDiagnostics and then call getGpuDiagnostics. If called in a multi-locale environment getGpuDiagnostics will return an array of counts of launches on a per-locale basis.

To get verbose output (indicating the location of each kernel launch) surround the code with calls to startVerboseGpu and stopVerboseGpu. This output will be directed to stdout.

To get a list of all GPU eligible loops at compile-time (regardless of if they will actually run on a GPU or not) pass chpl the --report-gpu flag.

Since not all Chapel loops are eligible for conversion into GPU kernels, it is helpful to be able to ensure that a particular loop is being executed on the GPU. This can be achieved by marking the loop with the @assertOnGpu attribute. When a forall or foreach loop is marked with this attribute, the compiler will perform a compile-time check and produce an error if one of the aforementioned requirements is not met. Loops marked with the @assertOnGpu attribute will also conduct a runtime assertion that will halt execution when not being performed on a GPU. This can happen when the loop is eligible for GPU execution, but is being executed outside of a GPU locale. The GPU module contains additional utility functions.

In some cases, it is desirable to write code that can execute on the GPU, but is not required to do so. In this case, @assertOnGpu’s runtime component is unnecessary. The @gpu.assertEligible attribute has the same compile-time behavior as @assertOnGpu, but does not perform this execution-time check.

Utilities in the MemDiagnostics module can be used to monitor GPU memory allocations and detect memory leaks. For example, startVerboseMem() and stopVerboseMem() can be used to enable and disable output from memory allocations and deallocations. GPU-based operations will be marked in the generated output.

Multi-Locale Support

The GPU locale model may be used alongside communication layers (values of CHPL_COMM other than none). This enables programs to use GPUs across nodes.

In this mode, normal remote access is supported outside of loops that are offloaded to the GPU; however, remote access within a kernel is not supported. An idiomatic way to use all GPUs available across locales is with nested coforall loops like the following:

coforall loc in Locales do on loc {
  coforall gpu in here.gpus do on gpu {
    foreach {
      // ...
    }
  }
}

For more examples see the tests under test/gpu/native/multiLocale available from our public Github repository.

Reductions and Scans

+, min and max reductions are supported via reduce expressions and intents. We are working towards expanding this to other kinds of reductions and scan expressions and deprecating the mentioned functions in the GPU module.

The GPU module has standalone functions for basic reductions (e.g. gpuSumReduce) and scans (e.g. gpuScan). We expect these functions to be deprecated in favor of reduce and scan expressions in a future release.

Device-to-Device Communication Support

Chapel supports direct communication between interconnected GPUs. The supported connection types are dictated by the GPU vendor.

For NVIDIA

PCIe and NVLink (on NVIDIA GPUs) are known to work.

This feature is disabled by default; it can be enabled by setting the enableGpuP2P configuration constant using the compiler flag -senableGpuP2P=true. Note that data movement does not require any code changes. The following example demonstrates using device-to-device communication to send data between two GPUs:

var dev1 = here.gpus[0],
    dev2 = here.gpus[1];
on dev1 {
  var dev1Data: [0..#1024] int;
  on dev2 {
    var dev2Data: [0..#1024] int;
    dev2Data = dev1Data;
  }
}

Notice that in this example, the GPU locales were stored into variables dev1 and dev2. Writing on here.gpus[1] in the second on statement directly would not be correct, since neither GPU locale has GPU sublocales of its own.

For AMD

The ROCm 5.x versions we support do not support enabling peer-to-peer communication in the way above. However, for optimum bandwidth between two devices export HSA_ENABLE_SDMA=0 can be used. This will enable using multiple Infinity Fabric links between GPUs/GCDs. However, note that it will do that by using kernels to move data. These kernel launches will be internal to ROCm and will not be captured by Chapel’s GPU diagnostic utilities. However, the impacts can be observable when an application needs to overlap computation and communication, as what the user thinks as “communication” will also involve kernel execution. More information about this can be found in in this article.

Memory Strategies

The CHPL_GPU_MEM_STRATEGY environment variable can be used to choose between two different memory strategies. Memory strategies determine how memory is allocated when on a GPU locale.

The current default strategy is array_on_device. This strategy stores array data directly on the device and store other data on the host in a page-locked manner. There are multiple benefits to using this strategy including that it will result in optimal communication performance between the host and the device and may be required for Chapel to interoperate with various third-party communication libraries.

The alternative is to set the environment variable explicitly to unified_memory. The strategy applies to all dynamically-allocated data on a GPU sublocale (i.e. here.gpus[0]). Under unified memory the underlying GPU implementation implicitly manages the migration of data to and from the GPU as necessary. Note that host data can be accessed from within a GPU eligible loop running on the device via a direct-memory transfer.

Debugger and Profiler Support for NVIDIA

cuda-gdb and NVIDIA NSight Compute can be used to debug and profile GPU kernels. We have limited experience with both of these tools. However, compiling with -g and running the application in cuda-gdb help uncover segmentation faults coming from GPU kernels.

Similarly, NSight Compute can be used to collect detailed performance metrics from GPU kernels generated by the Chapel compiler. By default, using -g only enables Chapel line numbers to be associated with performance metrics, however it thwarts optimizations done by the backend assembler. In our experience, this can reduce execution performance significantly, making profiling less valuable. To avoid this, please use --gpu-ptxas-enforce-optimization while compiling alongside -g, and of course, --fast.

Examining Generated Assembly

While analyzing performance, users might also wish to look at the assembly chpl generates for GPU kernels. To do this pass chpl --savec <dirName> (replacing <dirname> with a directory name to contain the generate assembly). The Chapel compiler will emit a file chpl__gpu.s, which contains AMD GCN or NVIDIA PTX instructions as appropriate.

In the generated assembly, kernels are named chpl_gpu_kernel_<fileName>_line_<num>_ (with filename replaced with the file containing the outlined loop and num as the line number of the loop header. For example, a kernel on line 3 of chpl.foo will be named chpl_gpu_kernel_foo_line_3_). The kernel name may have a number as a suffix if the same line of code required multiple kernels to be generated. Typically, this can happen if the loop in question was in a generic function with multiple instantiations.

Chapel Tasks and GPU Execution

Chapel runtime will use a GPU stream per-task, per-device by default. While individual streams are synchronized with the host after each operation (e.g., whole array operations and kernel launches will return only when the operation is completed), this allows efficiently oversubscribing GPUs by running multiple tasks on them to gain more performance by allowing the device runtime to overlap data movement with computation.

  • This behavior is disabled for CHPL_GPU_MEM_STRATEGY=unified_memory.

  • It can also be disabled for the default CHPL_GPU_MEM_STRATEGY=array_on_device, by running the application with --gpuUseStreamPerTask=false.

See the asyncTaskComm benchmark for a full example of a pattern that benefits from oversubscribing GPUs.

Known Limitations

We are aware of the following limitations and plan to work on them among other improvements in the future.

  • Intel GPUs are not supported, yet.

  • Distributed arrays cannot be used within GPU kernels.

  • PGAS style communication is not available within GPU kernels; that is: reading from or writing to a variable that is stored on a different locale from inside a GPU eligible loop (when executing on a GPU) is not supported.

  • Runtime checks such as bounds checks and nil-dereference checks are automatically disabled for CHPL_LOCALE_MODEL=gpu. i.e., --no-checks is implied when compiling.

  • The use of most extern functions within a GPU eligible loop is not supported (a limited set of functions used by Chapel’s runtime library are supported).

  • It’s not currently possible to compile for multiple AMD GPU architectures at the same time.

  • Associative arrays cannot be used on GPU sublocales with CHPL_GPU_MEM_STRATEGY=array_on_device.

  • CHPL_TASKS=fifo is not supported. Note that fifo tasking layer is the default in only Cygwin and NetBSD.

  • The compiler assumes without complete checking that the loop indices of the loops executed on GPUs are incremented by 1.

Using C Interoperability

C interoperability on the host side is supported. However, GPU programming implies C++ linkage. To handle that, the Chapel compiler compiles the .c files passed via the command line and/or require statements with clang -x [cuda|hip]. This implies that some C features may fail to compile if they are not supported by the above clang compilation.

Performance Tips

  • If measuring performance, and using an NVIDIA GPU, please be aware that GPU initialization may incur a 1-3 second startup cost per GPU due to ECC scrubbing. This initialization occurs when starting a gpu-enabled Chapel program when NVIDIA’s kernel mode driver is not already loaded and running. If you are using Linux and not running an X server on the target GPU, then you may wish to install NVIDIA’s `driver persistence daemon to alleviate this issue.

Tested Configurations

We have experience with the following hardware and software versions. The ones marked with * are covered in our nightly testing configurations.

  • NVIDIA

    • Hardware: RTX A2000, P100*, V100*, A100*, H100, GH200

    • Software: CUDA 11.3*, 11.6, 11.8*, 12.0, 12.2*, 12.4

  • AMD

    • Hardware: MI60*, MI100 and MI250X*

    • Software:ROCm 5.4*, 6.0, 6.1, 6.2*

GPU Support on Windows Subsystem for Linux

NVIDIA GPUs can be used on Windows through through WSL. To enable GPU support on WSL we require the CUDA Toolkit to be installed in the WSL environment and the NVIDIA driver to be installed on the Windows host. See the NVIDIA documentation for more information on setting up CUDA on WSL. See Using Chapel on WSL for more information on using Chapel with WSL.

Note

This configuration is not currently tested nightly. Please report any issues you encounter when using Chapel on WSL by filing a bug report

Further Information

  • The GPU Programming in Chapel series is a good resource for getting started with GPU programming in Chapel.

  • Please refer to issues with GPU Support label for other known limitations and issues.

  • Alternatively, you can add the bug label for known bugs only.

  • Additional information about GPU Support can be found in the “GPU Support” slide decks from our release notes; however, be aware that information presented in release notes for prior releases may be out-of-date.