Moving data between host (CPU) and device (GPU) memories is a crucial part of GPU programming. A very common pattern is to initialize some input data on the host memory, offload the data to the device memory, crunch numbers in parallel using the GPU, and then copy the resulting data back. If you have access to multiple GPUs, potentially across multiple nodes in an HPC system, data movement can get even more complicated. How do data movement and multi-GPU parallelism work in Chapel? We will explore the answer to that question in this article.

Before starting with some code, I recommend taking a look at Daniel Fedorin’s Introduction to GPU Programming in Chapel which covers some basics. The key points from that article that I’ll assume you’re aware of are:

Refresher: Allocating data

Based on those key points above, let’s review how data can be allocated on device memory in Chapel in more detail here.

4
5
6
7
8
on here.gpus[0] {
  var DevArr: [1..5] int;  // allocates 'DevArr' on the device
  DevArr += 1;             // executes on the device as a kernel
  writeln(DevArr);         // prints "1 1 1 1 1"
}

Recall that here.gpus[0] refers to a GPU sublocale that represents the first device on the node. The body of the on statement targeting that sublocale will cause:

Therefore, DevArr is an array whose elements are allocated in device memory. There’s nothing special about the array’s definition. The array’s elements are allocated on the GPU because it is defined while executing on a GPU sublocale. If we look at the full code, you can see that both host- and device-allocated arrays are declared in the same way:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
var HostArr: [1..5] int;   // allocates 'HostArr' on the host
HostArr = 1;               // executes on the (multicore) CPU

on here.gpus[0] {
  var DevArr: [1..5] int;  // allocates 'DevArr' on the device
  DevArr += 1;             // executes on the device as a kernel
  writeln(DevArr);         // prints "1 1 1 1 1"
}

writeln(HostArr);          // prints "1 1 1 1 1"

Rolling up our sleeves: Moving data between host and device

The example above keeps host and device values where they were declared, and it doesn’t involve any data movement between different memory types. Given that both host- and device-allocated arrays are just Chapel arrays with the same type and features, do you have a guess as to how to copy data between them? For the answer, let’s look at a slightly modified code where the host array is copied into the device array and then copied back after an increment operation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
var HostArr: [1..5] int;   // allocates 'HostArr' on the host
HostArr = 1;               // executes on the (multicore) CPU

on here.gpus[0] {
  var DevArr: [1..5] int;  // allocates 'DevArr' on the device

  DevArr = HostArr;        // copies values from host to device
  DevArr += 1;             // executes on the device as a kernel
  HostArr = DevArr;        // copies values from device to host
}

writeln(HostArr);          // prints "2 2 2 2 2"

Any surprises?

Chapel arrays can be assigned to one another using the assignment operator, =. The array implementation can figure out where array data is allocated and will perform bulk copy operations under the hood. In other words, the copies back and forth in the highlighted lines will each result in a single data movement operation between the host and the device.

Getting serious: Move parts of an array between host and device

More often than not, you’ll want to move only parts of the data to the device, operate on that part, copy the output from that part back to the host, and repeat until you finish processing your data in full. There could be multiple reasons for doing this:

Let’s expand on our previous example and make it closer to a case where you have a very large input array:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import RangeChunk.chunks;

config const n = 32,              // the problem size (use `--n` to specify!)
             sliceSize = 4;       // the number of elements per slice

const numSlices = n/sliceSize+1;  // (+1 is to round up)

var HostArr: [1..n] int;          // allocates 'HostArr' on the host
HostArr = 1;                      // executes on the (multicore) CPU

on here.gpus[0] {
  for chunk in chunks(1..n, numSlices) {
    var DevArr: [chunk] int;      // allocates 'DevArr' on the device

    DevArr = HostArr[chunk];      // copies a slice from host to device
    DevArr += 1;                  // executes on the device as a kernel
    HostArr[chunk] = DevArr;      // copies from device to a slice on host
  }
}

writeln(HostArr);                 // prints "2 2 2 2 2 ..."
config const looks cool, what is that?
config is a unique and very powerful concept in Chapel and has nothing to do with GPU support. Variables at module scope can be declared with the config qualifier to turn them into compile- or execution-time arguments. For example, with the source above, the application can be run using ./slices --n=2_000_000 --sliceSize=1000 to set n to 2,000,000 and the slice size to 1000. You can read more about config in the Chapel Users Guide.

There are two key changes from the previous example. First, we use the RangeChunk module to slice the range 1..n up into numSlices chunks, which are also of type range. The chunks iterator makes sure that the generated chunks cover the whole range, even if n is not divisible by numSlices.

More importantly, the loop body now copies array slices back and forth. Chapel arrays can be indexed with the [] syntax. However, that syntax can also be used to slice arrays. For example, if the argument to the operator is a range, then an array slice will be created. An array slice does not have its own data; it uses the same data as the original array. You may have guessed it already, but an array slice can be used pretty much anywhere an array can be used. If they are used in assignments like the ones in the highlighted lines, the corresponding data will be copied in bulk. The fact that one or both sides of an array assignment is an array slice whose data is stored on a GPU does not matter — Chapel will copy the relevant data between the host and the device under the hood.

Getting efficient: Overlapping data transfer with computation

Currently, the GPU sublocale copies a slice of HostArr into the device memory first. Once the copy is complete, the kernel executes. And then, once the kernel finishes, the data is copied back. Only then does the program move on to copying the next slice. This sequential execution of copy-execute-copy leaves some performance on the table. GPUs are capable of transferring data while executing a kernel. Some higher-end GPUs can even handle multiple data copies in each direction (host-to-device, device-to-host) while executing a kernel.

Overlapping data movement and processing is a common optimization. In a typical overlapping scenario, a single device is used by multiple parallel execution units. Some of these parallel units can perform data transfers while another executes a kernel. Let’s take a look at how we can achieve this with Chapel’s features for parallelism:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import RangeChunk.chunks;

config const n = 32,              // the problem size (use `--n` to specify!)
             sliceSize = 4;       // the number of elements per slice

const numSlices = n/sliceSize+1;  // (+1 is to round up)

var HostArr: [1..n] int;          // allocates 'HostArr' on the host
HostArr = 1;                      // executes on the (multicore) CPU

on here.gpus[0] {
  coforall chunk in chunks(1..n, numSlices) {
    var DevArr: [chunk] int;      // allocates 'DevArr' on the device *per task*

    DevArr = HostArr[chunk];      // copies a slice from host to device
    DevArr += 1;                  // executes on the device as a kernel
    HostArr[chunk] = DevArr;      // copies from device to a slice on host
  }
}

writeln(HostArr);                 // prints "2 2 2 2 2 ..."

The key difference from the previous example is highlighted. Here, we are using a coforall loop instead of a for loop. A coforall loop creates a parallel task per iteration. These tasks can run concurrently.

Chapel tasks running on the same GPU sublocale can execute GPU operations — including transfers and kernel launch — independently of one another. Depending on the capabilities of your particular hardware and available resources, the underlying GPU can overlap data transfer and execution from distinct tasks, leading to [note: Admittedly, overlap is an optimization which can be a hard to fine-tune; the balance between transfer performance (bandwidth and latency) and computational intensity should be handled delicately. Furthermore, fine-tuning overlap is typically specific to the hardware and application characteristics. For example, in our very simple case of launching a kernel to increment each element of the array by one, I don't expect any gain from overlap. This application has almost no computation, and the execution time will be dominated by data transfer, diminishing any chance to overlap. However, for more realistic and computationally intensive applications, the benefits can be significant. We plan to revisit overlap with a more realistic example in a future article. ].

Tell me more about how this works before we move on

GPUs can perform transfer and execution asynchronously from the CPU. This is achieved by a concept called a stream or work queue. The CPU can create multiple streams per GPU, and queue up operations in those streams. The GPU driver guarantees that the order of operations within a stream will be preserved. However, it can schedule operations from different streams in any order it sees fit. Typically, this is driven by the availability of data transfer and/or execution units at the time of scheduling.

Chapel tasks create and use their own streams when interacting with GPUs. This enables operations coming from distinct tasks to be scheduled concurrently for potential overlap.

Using per-task streams can enable overlap

The figure above shows the difference between using the default stream and per-task GPU streams in the context of two tasks and a toy workload, where the scheduling permitted by the per-task streams results in faster execution.

Getting parallel: Use multiple GPUs concurrently

One of the cases where you might want to copy slices of arrays back and forth is when you have multiple GPUs in your workstation/server or you have access to a cluster or a supercomputer with multiple GPUs per node. Luckily, we have already seen a way of executing a number of parallel tasks concurrently. Yes, I am talking about coforall loops:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import RangeChunk.chunks;

config const n = 32,              // the problem size (use `--n` to specify!)
             sliceSize = 4;       // the number of elements per slice

var HostArr: [1..n] int;          // allocates 'HostArr' on the host
HostArr = 1;                      // executes on the (multicore) CPU

const numGpus = here.gpus.size;   // number of GPUs on the locale
coforall (gpu, gpuChunk) in zip(here.gpus, chunks(1..n, numGpus)) {
  on gpu {
    const numSlices = gpuChunk.size/sliceSize+1;  // (+1 is to round up)

    coforall chunk in chunks(gpuChunk, numSlices) {
      var DevArr: [chunk] int;    // allocates 'DevArr' on the device

      DevArr = HostArr[chunk];    // copies a slice from host to device
      DevArr += 1;                // executes on the device as a kernel
      HostArr[chunk] = DevArr;    // copies from device to a slice on host
    }
  }
}

writeln(HostArr);  // prints "2 2 2 2 2 ..."

In this code, we introduce a second coforall loop, which we use to compute with multiple GPUs simultaneously. This loop uses zip, which causes it to iterate over the GPUs in the current node and the chunks generated by the chunks iterator in a lock-step manner. Here, we are iterating over the GPUs in the current node, and the chunks generated by the chunks iterator as we have seen before.

In the end, gpu and gpuChunk represent a GPU sublocale and the chunk that it needs to work on for each task created by the coforall loop. Then, an on statement moves each task onto its corresponding GPU sublocale.

Using a coforall loop together with an on statement to use multiple GPU sublocales concurrently is nothing new in Chapel. In fact, coforall loops are commonly used with on statements to parallelize work across multiple compute nodes. In the next step, we will re-use the idiom introduced here to expand the execution to multiple nodes.

Getting distributed: Use multiple nodes with multiple GPUs concurrently

We can expand our previous example to use multiple nodes by adding another similar coforall loop and an on statement:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import RangeChunk.chunks;

config const n = 32;                // the problem size (use `--n` to specify!)
config const sliceSize = 4;         // the number of elements per slice

var HostArr: [1..n] int;            // allocates on the host
HostArr = 1;                        // executes on the (multicore) CPU

coforall (loc, locChunk) in zip(Locales, chunks(1..n, numLocales)) {
  on loc {
    const numGpus = here.gpus.size;
    coforall (gpu, gpuChunk) in zip(here.gpus, chunks(locChunk, numGpus)) {
      on gpu {
        const numSlices = gpuChunk.size/sliceSize+1;  // (+1 is to round up)

        coforall chunk in chunks(gpuChunk, numSlices) {
          var DevArr: [chunk] int;  // allocates on the device

          DevArr = HostArr[chunk];  // copies a slice from host to device
          DevArr += 1;              // executes on the device as a kernel
          HostArr[chunk] = DevArr;  // copies from device to a slice on host
        }
      }
    }
  }
}

writeln(HostArr);  // prints "2 2 2 2 2 ..."

The highlighted idiom should be familiar. Instead of iterating over here.gpus as in Line 12, we are iterating over Locales. Locales is a built-in array that represents all compute nodes the application is using. Similarly, numLocales is another built-in that’s just a slightly more convenient way of doing Locales.size. Finally, the on statement here targets a locale that represents a compute node.

Overall, this snippet uses the same idiom to distribute work across compute nodes and across GPUs within them.

Summary

We’ve covered a lot of ground in this post:

If you were already familiar with Chapel, the key takeaway is that there aren’t a lot of new concepts to learn that enable GPU programming — the features that you know about can readily enable GPU programming. If you are new to Chapel but know about GPU programming, the key takeaway is that Chapel makes programming GPUs feel as natural as programming CPUs. If you are new to both Chapel and GPU programming, the key takeaway is that GPU programming doesn’t have to be scary!

So far, we have only covered how to program GPUs in Chapel. None of it matters unless it performs and scales well. Stay tuned for our upcoming GPU blog posts about performance and scaling!