GPU Programming¶
Chapel includes preliminary work to target NVidia GPUs by generating and
packing PTX assembly and linking against and using the CUDA driver API at
runtime. This work is under active development and has not yet been tested
under a wide variety of environments. We have tested it on systems with NVidia
Tesla P100s using CUDA 11.0 and on a system with NVidia Ampere A100 with CUDA
11.6. The current implementation will generate CUDA kernel code (PTX assembly)
for certain forall
and foreach
loops and these kernels will be launched
onto a GPU when the current locale (e.g. here
) is assigned to a special
(sub)locale representing the GPU.
For more information about what loops are eligible for GPU execution see the Overview section. For more information about what is supported see the requirements and Requirements and Limitations section. To see an example program written in Chapel that will execute on a GPU see the code listing in the Examples section. For more information about specific features related to GPU support see the subsections under GPU Support Features. Additional information about GPU Support can be found in the “Ongoing Efforts” slide decks of our release notes; however, be aware that information presented in release notes for prior releases may be out-of-date.
Contents
Overview¶
To deploy code to a GPU, put the relevant code in an on
statement targeting
a GPU sublocale (i.e. here.gpus[0]
).
Any arrays that are declared by tasks executing on a GPU sublocale will, by default, be allocated into unified memory and be accessible on the GPU (see the Memory Strategies subsection for more information about alternate memory strategies).
Chapel will launch CUDA kernels for all eligible loops that are encountered by tasks executing on a GPU sublocale. Loops are eligible when:
They are order-independent (e.g.,
forall
orforeach
).They only make use of known compiler primitives that are fast and local. Here “fast” means “safe to run in a signal handler” and “local” means “doesn’t cause any network communication”.
They are free of any call to a function that fails to meet the above criteria, accesses outer variables, or are recursive.
Any code in an on
statement for a GPU sublocale that is not within an
eligible loop will be executed on the CPU.
Examples¶
The following example illustrates running a computation on a GPU as well as a
CPU. When jacobi
is called with a GPU locale it will allocate the arrays
A
and B
on the device memory of the GPU and we generate three GPU
kernels for the forall
loops in the function.
config const nSteps = 10;
config const n = 10;
writeln("on GPU:");
jacobi(here.gpus[0]);
writeln("on CPU:");
jacobi(here);
proc jacobi(loc) {
on loc {
var A, B: [0..n+1] real;
A[0] = 1; A[n+1] = 1;
forall i in 1..n { A[i] = i:real; }
for step in 1..nSteps {
forall i in 1..n { B[i] = 0.33333 * (A[i-1] + A[i] + A[i+1]); }
forall i in 1..n { A[i] = 0.33333 * (B[i-1] + B[i] + B[i+1]); }
}
writeln(A);
}
}
For additional examples we suggest looking at some of our internal tests. Note that these are not packaged in the Chapel release but are accessible from our public Github repository.
Tests of particular interest include:
Benchmark examples:¶
Jacobi – Jacobi example (shown above)
Stream – GPU enabled version of Stream benchmark
SHOC Triad (Direct) – a transliterated version of the SHOC Triad benchmark
SHOC Triad (Chapeltastic) – a version of the SHOC benchmark simplified to use Chapel language features (such as promotion)
SHOC Sort – SHOC radix sort benchmark
Test examples:¶
assertOnFailToGpuize – various examples of loops that are not eligible for GPU execution
math – calls to various math functions within kernels that call out to the CUDA Math library
measureGpuCycles – measuring time within a GPU kernel
promotion2 – GPU kernels from promoted expressions
Examples with multiple GPUs:¶
multiGPU – simple example using all GPUs within a locale
workSharing – stream-like example showing computation shared between GPUs and CPU
onAllGpusOnAllLocales – simple example using all GPUs and locales
copyToLocaleThenToGpu – stream-like example (with data initialized on Locale 0 then transferred to other locales and GPUs)
Setup and Compilation¶
To enable GPU support set the environment variable CHPL_LOCALE_MODEL=gpu
before building Chapel. Chapel’s build system will automatically try and deduce
where your installation of CUDA exists. If the build system fails to do this,
or you would like to use a different CUDA installation, you can set
CHPL_CUDA_PATH
environment variable to the CUDA installation root.
We also suggest setting CHPL_RT_NUM_THREADS_PER_LOCALE=1
(this is necessary
if using CUDA 10).
To compile a program simply execute chpl
as normal. By default the generated
code will target compute capability 6.0 (specifically by passing
--cuda-gpu-arch=sm_60
when invoking clang). If you would like to target a
different compute capability (necessary for example, when targeting Tesla K20
GPUs) you can pass --gpu-arch
to chpl
and specify a different value
there. This may also be set using the CHPL_GPU_ARCH
environment variable.
If you would like to view debugging information you can pass --verbose
to
your generated executable. This output will show the invocation of CUDA kernel
calls along with various other interactions with the GPU such as memory
operations. You may also use the GPUDiagnostics
module to gather
similar information.
Requirements and Limitations¶
Because of the early nature of the GPU support project there are a number of limitations. We provide a (non exhaustive) list of these limitations in this section; many of them will be addressed in upcoming editions.
We currently only target NVIDIA GPUs (although we are working on adding support for AMD GPUs; see the section under Prototypical AMD GPU Support).
LLVM
must be used as Chapel’s backend compiler (i.e.CHPL_LLVM
must be set tosystem
orbundled
). For more information about these settings see Optional Settings.If using a system install of
LLVM
we expect this to be the same version as the bundled version (currently 14). Older versions may work; however, we only make efforts to test GPU support with this version.CHPL_TASKS=qthreads
is required for GPU support.PGAS style communication is not available within GPU kernels; that is: reading from or writing to a variable that is stored on a different locale from inside a GPU eligible loop (when executing on a GPU) is not supported.
There is no user-level feature to specify GPU block size on a per-kernel basis. This can be set on a program wide basis at compile-time by passing
--gpu-block-size=size
to the compiler or setting it with theCHPL_GPU_BLOCK_SIZE
environment variable.There is no user-level feature to allocate or access block shared memory.
The use of most
extern
functions within a GPU eligible loop is not supported (a limited set of functions used by Chapel’s runtime library are supported).Various functions within Chapel’s standard modules call unsupported
extern
functions and thus are not supported in GPU eligible loops.
Runtime checks such as bounds checks and nil-dereference checks are automatically disabled for CHPL_LOCALE_MODEL=gpu.
For loops to be considered eligible for execution on a GPU they must fulfill the requirements discussed in the Overview section.
GPU Support Features¶
In the following subsections we discuss various features or aspects of GPU supports that are relatively new or otherwise noteworthy.
Diagnostics and Utilities¶
The GPUDiagnostics
module contains functions to help users count and
track kernel launches.
To count the number of kernel launches that occur in a section of code,
surround that code with calls to startGPUDiagnostics
and stopGPUDiagnostics
and then call
getGPUDiagnostics
. If called in a multi-locale
environment getGPUDiagnostics
will return an array of
counts of launches on a per-locale basis.
To get verbose output (indicating the location of each kernel launch) surround
the code with calls to startVerboseGPU
and
stopVerboseGPU
. This output will directed to
stdout
.
The GPU
module contains additional utility functions. One particularly
useful function is assertOnGpu()
. This function will conduct a
runtime assertion that will halt execution when not being performed on a GPU.
If assertOnGpu()
appears as the first line of forall
or
foreach
loop the Chapel compiler will do a compile-time check and produce
an error if one of the aforementioned requirements is not met. This check
might also occur if assertOnGpu()
is placed elsewhere in the loop
depending on the presence of control flow.
Multi-Locale Support¶
As of Chapel 1.27.0 the GPU locale model may be used alongside communication
layers (values of CHPL_COMM
) other than none
. This enables programs to
use GPUs across nodes.
In this mode, normal remote access is supported outside of loops that are
offloaded to the GPU; however, remote access within a kernel is not supported.
An idiomatic way to use all GPUs available across locales is with nested
coforall
loops like the following:
coforall loc in Locales do on loc {
coforall gpu in here.gpus do on gpu {
forall {
// ...
}
}
}
For more examples see the tests under test/gpu/native/multiLocale
available from our public Github repository.
Memory Strategies¶
Currently by default Chapel uses NVIDIA’s unified memory feature to store data
that is allocated on a GPU sublocale (i.e. here.gpus[0]
). Under unified
memory the CUDA driver implicitly manages the migration of data to and from the
GPU as necessary.
We provide an alternate memory allocation strategy that stores array data directly on the device and store other data on the host. There are multiple benefits to using this strategy including that it enables users to have more explicit control over memory management, may be required for Chapel to interoperate with various third-party communication libraries, and may be necessary to achieve good performance. As such it may become the default memory strategy we use in the future. Be aware though that because this strategy is relatively new addition it hasn’t been as thoroughly tested as our unified-memory based approach.
To use this new strategy set the environment variable CHPL_GPU_MEM_STRATEGY
to array_on_device
. For more examples that work with this strategy see
the tests under test/gpu/native/page-locked-mem/
available from our public Github
repository.
Note that host data can be accessed from within a GPU eligible loop running on the device via a direct-memory transfer.
One limitation with memory access in this mode is that we do not support direct
reads or writes from the host into individual elements of array data allocated
on the GPU (e.g. use(A[i])
or A[i] = ...
). Array data accessed “as a
whole” (e.g. writeln(A)
) will work, however.
Prototypical AMD GPU Support¶
We are working on adding AMD GPU support. A very early stage prototype
is currently available in the compiler. It works in a similar manner to
the NVidia GPU implementation: the Chapel compiler generates AMD HSA binary files and bundles
them into the resulting executable. Currently, there is no runtime implementation
that executes the generated kernels; however, extern C
code can be used to invoke the HIP API and manually launch a kernel. Furthermore,
only procedures marked with pragma "codegen for GPU"
are converted into
kernels. See test/gpu/native/amd/extern_kernel_launch.chpl
for an example this in action.
To try the AMD GPU support prototype, the process is generally the same as that
found in Setup and Compilation. Instead of configuring the path to the CUDA
SDK, you will need to set the CHPL_ROCM_PATH
to the location of the ROCm SDK
on your system. Furthermore, you will need to adjust the CHPL_GPU_CODEGEN
environment variable to rocm
. The CHPL_GPU_ARCH
environment variable
(or the --gpu-arch
compiler flag) can be used to select the GPU architecture;
the table in LLVM’s AMD documentation
is useful to map GPUs to their architecture.