Measure the Performance of your Gaming GPU with Chapel

Put your gaming system to work with Chapel

Modern gaming systems have tremendous parallel potential. Today’s offerings from AMD and Intel combine with GPUs from NVIDIA to build multi-CPU systems that command thousands of GPU cores. While not supercomputers, they are definitely super computers. It would be nice if we could use Chapel to take control of all that raw power, but it is almost always locked inside the Windows operating system, making it an inhospitable development environment for HPC programs. Although Chapel runs natively on Linux and MacOS, support for Chapel on Windows was previously limited to Cygwin, degrading the performance and limiting Chapel to an incomplete set of features. The Windows Subsystem for Linux (WSL) allows Chapel to behave and perform more like it would on native Linux and is now the preferred way to develop Chapel programs on Windows.

In this post, I’m going to show that we can use WSL to run Chapel code on our NVIDIA GPUs hosted in Windows, while also getting performance that is on par with lower-level CUDA code. We’ll be evaluating performance with a commonly used memory streaming benchmark known as STREAM Triad. The STREAM Triad algorithm performs simple processing on three arrays of equal length to form a synthetic benchmark that is suitable for measuring memory bandwidth.

The pseudocode for the calculation looks something like this:

for i in 1..A.size do
  A[i] = B[i] + k * C[i]

Memory bandwidth is the measurement of how much data your graphics card can move between the GPU and the card’s vRAM in a fixed amount of time. This is an important metric for gaming, especially at higher resolutions and refresh rates that are becoming more popular; but it’s also important for image processing and machine learning applications that can have their performance limited by the interface between the GPU and vRAM. The STREAM Triad benchmark will help you understand how the actual bandwidth of your GPU compares to its maximum theoretical bandwidth.

To follow along on your own with this demonstration, you’ll need access to what I’m going to daringly call a “typical” gaming computer[note:Steam hardware surveys over the previous 18 months report a pretty consistent 90–95% market dominance by Windows 10 and 11. According to the same surveys, NVIDIA GPUs are the most popular in gaming systems, accounting for around 70–75% of systems with dedicated graphic processing units.] — that is, one that runs Windows and has a GPU from NVIDIA[note:Note that Chapel’s GPU support enables Chapel code to run on GPUs from both AMD and NVIDIA. It’s just that AMD does not support accessing your card from WSL at the time of this writing.].

To start, let’s review the setup. I’ll assume you have a Windows 10- or 11-based PC and a GPU from NVIDIA’s 10XX series or newer. You’ll also need several GB of free disk space, the amount varying depending on what you might already have installed. We should also get an idea of our card’s theoretical maximum memory bandwidth. This can often be found on third-party websites that track these sorts of stats. For my RTX 2070 Super, the memory bandwidth is reported to be ~448 GB/s.

To help navigate this guide, choose your own adventure based on your initial state:

I do not have WSL installed: continue to the next section (3–4 GB free space needed)
I have WSL, but not the CUDA Toolkit: start by installing the CUDA Toolkit (<2 GB free space needed)
I have WSL and the CUDA Toolkit, take me to the ~~final boss~~ last step in this post. (<1 GB free space needed)

Install WSL

To use WSL you must have virtual machine extensions enabled for your CPU. These are called VT-x for Intel and AMD-V for AMD systems. You’ll also need to enable the Windows feature called Virtual Machine Platform. It’s possible your system already has these settings enabled, but if not, Microsoft has a handy guide for enabling the necessary features.

You can install WSL from PowerShell or even from the Windows Store, but I will follow the PowerShell installation method in this demo.

Operations:

open up PowerShell or a command prompt
run wsl --install -d Ubuntu
reboot

At this point, you should have WSL installed. You can start WSL in a variety of ways, but for now we’ll just keep using the command prompt. Go ahead and type wsl at a PowerShell prompt to get a new Ubuntu shell.

Use CUDA from WSL

NVIDIA has published specific instructions and even made a video detailing how to utilize your video card from WSL. In order to compile CUDA code in WSL we need the CUDA compiler nvcc provided by the CUDA Toolkit; but otherwise the WSL installation can run CUDA code out of the box using the drivers already in Windows. This leads to an important detail of the next step: installing the GPU driver in WSL is not necessary! It requires some care on our part to avoid accidentally installing drivers alongside the CUDA Toolkit because doing so can cause problems when trying to run CUDA code in WSL. Thankfully, NVIDIA has made a specific set of downloads available for WSL Ubuntu installations that leave the driver installation out by default.

Before we get to installing the toolkit, we should verify that we can access the video card using the NVIDIA drivers that are already installed in Windows. If you happen to update your drivers during this process, remember to close and restart WSL before going forward with the next steps.

Test WSL access to GPU

Launch WSL
Type nvidia-smi and examine the output

(Example nvidia-smi output)

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.31.01              Driver Version: 560.81         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2070 ...    On  |   00000000:01:00.0  On |                  N/A |
| 28%   35C    P5             18W /  215W |    1165MiB /   8192MiB |     25%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

note: If you don’t get this output check that /usr/lib/wsl/lib/ exists and contains nvidia-smi. You may need to close and restart WSL.

Install the CUDA Toolkit

To compile CUDA programs, we are going to need the CUDA Toolkit installed in WSL. To avoid the additional complexity of getting a more recent version of LLVM on our Ubuntu distribution, we are going to use CUDA version 11.8, as using a newer version requires us to perform some additional steps.[note:We’d need to have LLVM > 15 available for CUDA 12. If you’re feeling up to the task, installing LLVM >= 16 will allow for CUDA 12. You can read more about the requirements from Chapel’s GPU setup notes]

There are several ways to get the CUDA Toolkit; in this demonstration, we’ll use the runfile published by NVIDIA. The installation instructions are reproduced here for convenience:

wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run

Accept the EULA
Install default options selected (only need toolkit — do not install driver!)
Ignore warning that says driver wasn’t installed, if it comes up

Follow Instructions to update local environment variables. These instructions will be given in the final output from the CUDA Toolkit install, but I am reproducing them here for completeness.

export PATH=/usr/local/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH

Note that if you close this Ubuntu shell you’ll need to redo these export commands. You can avoid this by adding them to your ~/.bashrc file so they will be automatically executed in every new Ubuntu shell. See the chapel-wsl-demo.txt file at the end for a list of commands used to set up this demo.

Test CUDA Toolkit installation in WSL

cuda-stream.cu

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217


/*
  STREAM benchmark implementation in CUDA.

    COPY:       a(i) = b(i)
    SCALE:      a(i) = q*b(i)
    SUM:        a(i) = b(i) + c(i)
    TRIAD:      a(i) = b(i) + q*c(i)

  It measures the memory system on the device.
  The implementation is in double precision.

  Code based on the code developed by John D. McCalpin
  http://www.cs.virginia.edu/stream/FTP/Code/stream.c

  Written by: Massimiliano Fatica, NVIDIA Corporation

  Further modifications by: Ben Cumming, CSCS; Andreas Herten (JSC/FZJ)

  Additional modifications by: Ahmad Rezaii and Brad Chamberlain to
  focus on the Triad kernel for this article
*/

#define NTIMES  10

#include <string>
#include <vector>

#include <stdio.h>
#include <float.h>
#include <limits.h>
#include <unistd.h>
#include <sys/time.h>

#include <sys/time.h>

# ifndef MIN
# define MIN(x,y) ((x)<(y)?(x):(y))
# endif
# ifndef MAX
# define MAX(x,y) ((x)>(y)?(x):(y))
# endif

typedef double real;

static double   avgtime[4] = {0}, maxtime[4] = {0},
        mintime[4] = {FLT_MAX,FLT_MAX,FLT_MAX,FLT_MAX};


void print_help()
{
    printf(
        "Usage: stream [-s] [-n <elements>] [-b <blocksize>]\n\n"
        "  -s\n"
        "        Print results in SI units (by default IEC units are used)\n\n"
        "  -n <elements>\n"
        "        Put <elements> values in the arrays\n"
        "        (defaults to 1<<26)\n\n"
        "  -b <blocksize>\n"
        "        Use <blocksize> as the number of threads in each block\n"
        "        (defaults to 192)\n"
    );
}

void parse_options(int argc, char** argv, bool& SI, int& N, int& blockSize)
{
    // Default values
    SI = false;
    N = 1<<26;
    blockSize = 192;

    int c;

    while ((c = getopt (argc, argv, "sn:b:h")) != -1)
        switch (c)
        {
            case 's':
                SI = true;
                break;
            case 'n':
                N = std::atoi(optarg);
                break;
            case 'b':
                blockSize = std::atoi(optarg);
                break;
            case 'h':
                print_help();
                std::exit(0);
                break;
            default:
                print_help();
                std::exit(1);
        }
}

/* A gettimeofday routine to give access to the wall
   clock timer on most UNIX-like systems.  */


double mysecond()
{
    struct timeval tp;
    struct timezone tzp;
    int i = gettimeofday(&tp,&tzp);
    return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}


template <typename T>
__global__ void set_array(T * __restrict__ const a, T value, int len)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < len)
        a[idx] = value;
}

template <typename T>
__global__ void STREAM_Triad(T* a, T* b, T* c, T scalar, int len)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < len)
        c[idx] = a[idx] + scalar * b[idx];
}

int main(int argc, char** argv)
{
    real *d_a, *d_b, *d_c;
    int j,k;
    double times[4][NTIMES];
    real scalar;
    std::vector<std::string> label;
    label.push_back("Copy:      ");
    label.push_back("Scale:      ");
    label.push_back("Add:      ");
    label.push_back("Triad:      ");

    // Parse arguments
    bool SI;
    int N, blockSize;
    parse_options(argc, argv, SI, N, blockSize);

    printf(" STREAM Benchmark implementation in CUDA\n");
    printf(" Array size (%s precision) =%7.2f MB\n", sizeof(double)==sizeof(real)?"double":"single", double(N)*double(sizeof(real))/1.e6);

    /* Allocate memory on device */
    cudaMalloc((void**)&d_a, sizeof(real)*N);
    cudaMalloc((void**)&d_b, sizeof(real)*N);
    cudaMalloc((void**)&d_c, sizeof(real)*N);

    /* Compute execution configuration */
    dim3 dimBlock(blockSize);
    dim3 dimGrid(N/dimBlock.x );
    if( N % dimBlock.x != 0 ) dimGrid.x+=1;

    printf(" using %d threads per block, %d blocks\n",dimBlock.x,dimGrid.x);

    if (SI)
        printf(" output in SI units (KB = 1000 B)\n");
    else
        printf(" output in IEC units (KiB = 1024 B)\n");

    /* Initialize memory on the device */
    set_array<real><<<dimGrid,dimBlock>>>(d_a, 2.f, N);
    set_array<real><<<dimGrid,dimBlock>>>(d_b, .5f, N);
    set_array<real><<<dimGrid,dimBlock>>>(d_c, .5f, N);

    /*  --- MAIN LOOP --- repeat test cases NTIMES times --- */

    scalar=3.0f;
    for (k=0; k<NTIMES; k++)
    {
        times[3][k]= mysecond();
        STREAM_Triad<real><<<dimGrid,dimBlock>>>(d_b, d_c, d_a, scalar,  N);
        cudaDeviceSynchronize();
        times[3][k]= mysecond() -  times[3][k];
    }

    /*  --- SUMMARY --- */

    for (k=1; k<NTIMES; k++) /* note -- skip first iteration */
    {
        for (j=0; j<4; j++)
        {
            avgtime[j] = avgtime[j] + times[j][k];
            mintime[j] = MIN(mintime[j], times[j][k]);
            maxtime[j] = MAX(maxtime[j], times[j][k]);
        }
    }

    double bytes[4] = {
        2 * sizeof(real) * (double)N,
        2 * sizeof(real) * (double)N,
        3 * sizeof(real) * (double)N,
        3 * sizeof(real) * (double)N
    };

    // Use right units
    const double G = SI ? 1.e9 : static_cast<double>(1<<30);

    printf("\nFunction      Rate %s  Avg time(s)  Min time(s)  Max time(s)\n",
           SI ? "(GB/s) " : "(GiB/s)" );
    printf("-----------------------------------------------------------------\n");
    for (j=3; j<4; j++) {
        avgtime[j] = avgtime[j]/(double)(NTIMES-1);

        printf("%s%11.4f     %11.8f  %11.8f  %11.8f\n", label[j].c_str(),
                bytes[j]/mintime[j] / G,
                avgtime[j],
                mintime[j],
                maxtime[j]);
    }


    /* Free memory on device */
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);
}

Save the code sample above as cuda-stream.cu and compile it using nvcc:

nvcc -O3 -o cuda-stream cuda-stream.cu

The -O3 flag tells the compiler to perform all the optimizations and -o just gives our output file a good name other than the default, a.out.

Now run the compiled binary:

./cuda-stream -s -b 512

Here the -s tells cuda-stream to write the output in SI units (e.g., KB as opposed to KiB), and the -b 512 sets the block size to 512 threads per block, which matches Chapel’s default.

You should see some output like this:

 STREAM Benchmark implementation in CUDA
 Array size (double precision) = 536.87 MB
 using 512 threads per block, 131072 blocks
 output in SI units (KB = 1000 B)

Function      Rate (GB/s)   Avg time(s)  Min time(s)  Max time(s)
-----------------------------------------------------------------
Triad:         399.5623      0.00418501   0.00403094   0.00468612

Note that the memory throughput reported by cuda-stream is about 400 GB/s, roughly 11% less than the theoretical maximum of 448 GB/s for the NVIDIA 2070 Super we looked up earlier.

If you were able to compile and run the sample program, then we are ready for the next (and final) steps, building Chapel with GPU support.

Build Chapel with GPU support

Prepare for Chapel

To prepare our new Ubuntu instance for Chapel, there are some packages we’ll need to install. The easiest way to do this is to grab the list of packages from the Chapel documentation. I have posted the commands for Ubuntu 22.04 from the Chapel documentation here for convenience.

This will install all the things we need to build the Chapel compiler and runtime in a later step.

sudo apt-get update
sudo apt-get install gcc g++ m4 perl python3 python3-dev bash make mawk git pkg-config cmake
sudo apt-get install llvm-dev llvm clang libclang-dev libclang-cpp-dev libedit-dev

Acquire Chapel sources

We are going to need a copy of the Chapel sources to build from. This demonstration relies on portability features that were first released with Chapel 2.0, which you can read more about in its release announcement. The sources for all Chapel releases are available from the releases page. Version 2.1 is the latest release at the time of this writing.

In your WSL shell, these commands will download and extract the Chapel source code into a new directory named ‘chapel-2.1.0’.

wget https://github.com/chapel-lang/chapel/releases/download/2.1.0/chapel-2.1.0.tar.gz
tar -xzf chapel-2.1.0.tar.gz

Go ahead and move into the chapel-2.1.0 directory now.

cd chapel-2.1.0

Next, we’ll configure and build Chapel.

Configure and build Chapel

Starting from the chapel-2.1.0 directory, we’ll set some environment variables and build Chapel.

First, source the configuration script util/setchplenv.bash:

source util/setchplenv.bash

This will set up the CHPL_HOME directory and add the output bin directory to the system PATH. This ensures the Chapel compiler chpl is available without having to know the full path when we want to use it later[note:Note that each export or source command only affects the current terminal, so as with setting the environment variables for CUDA before, you can either perform these steps each time you open a new terminal, or configure the system to do that for you by placing these lines in your .bashrc file, typically located at ~/.bashrc. See the chapel-wsl-demo.txt file at the end of this post for a list of all the commands used to set up this demo.]. Because we want to build Chapel with GPU support, we need to set a few other environment variables to configure the Chapel compiler and runtime.

export CHPL_LLVM=system
export CHPL_LOCALE_MODEL=gpu

The first line tells Chapel to use the LLVM we installed earlier as the code generation backend. LLVM is required when building Chapel with GPU support. Setting the locale model[note:The locale model is an abstraction for how a computer’s processors and memory are exposed to a Chapel program, and it is a defining feature of Chapel. Other blog posts that discuss locales are Advent of Code 2022 and Intro to GPUs.] to gpu is essentially the big switch that tells Chapel to use the GPU for eligible portions of your Chapel code. See the documentation for more information about loop structures that are eligible for GPU locales.

At this point, it might be useful to run printchplenv to see the various environment variables and values that Chapel will build and run with. The descriptions and more details about the individual variables are available in the Chapel documentation that describes setting up your environment. Most of these should not need to be adjusted from whatever the Chapel environment scripts selected for your system, but let’s look at a few of the important ones for our setup, CHPL_LLVM, CHPL_GPU, and CHPL_LOCALE_MODEL.

CHPL_LOCALE_MODEL: gpu *
  CHPL_GPU: nvidia
...
CHPL_LLVM: system *

Notice that the build scripts correctly detected that we’re using an NVIDIA GPU, and that the output indicates values we’ve explicitly set in the environment with an *.

(Full output of my printchplenv, for the curious reader)

machine info: Linux HECTOR 5.15.133.1-microsoft-standard-WSL2 #1 SMP Thu Oct 5 21:02:42 UTC 2023 x86_64
CHPL_HOME: /home/ahmad/chapel-2.1.0 *
script location: /home/ahmad/chapel-2.1.0/util/chplenv
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: gpu *
  CHPL_GPU: nvidia
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: system *
CHPL_AUX_FILESYS: none

Now we just need to build Chapel with the make command. I recommend a parallel build using the -j option limited to the number of cores on your system. We can use the nproc utility to determine the number of cores available, and if we have sufficient memory available on the PC, we should be able to just use all the cores.

make -j`nproc`

Measure GPU memory bandwidth using Chapel

Once Chapel is built, we are ready to compile and run our example code!

The chapel-stream code is an implementation of the STREAM Triad benchmark that has been adapted from similar examples in C++ and CUDA. Recall from the introduction that it is a synthetic benchmark to measure memory bandwidth. You can read more about data movement and GPU programming with Chapel in earlier posts from the GPU Programming series that this article is a part of.

chapel-stream.chpl

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158


//
// Use standard modules for timing routines and type utility functions
//
use Time, Types;

//
// Use a common HPCC user module containing helper routines
//
use HPCCProblemSize;

//
// Whether to use SI units (true == GB, false == GiB)
//
config const SI = true;

//
// Capture an alias for the host CPU where we start execution
//
const host = here;

//
// The number of vectors and element type of those vectors
//
const numVectors = 3;
type elemType = real(64);

//
// Configuration constants to set the problem size (m) and the scalar
// multiplier (alpha)
//
config const m = 1<<26,
             alpha = 3.0;

//
// Configuration constants to set the number of trials to run and the
// amount of error to permit in the verification
//
config const numTrials = 10,
             epsilon = 0.0;

//
// Configuration constants to control what's printed -- benchmark
// parameters, input and output arrays, and/or statistics
//
config const printParams = true,
             printArrays = false,
             printStats = true;

//
// The program entry point
//
proc main() {
  printConfiguration();  // print the problem size, number of trials, etc.

  //
  // Move execution from the host CPU to the initial GPU
  //
  on here.gpus[0] {
    //
    // ProblemSpace describes the index set for the three vectors.  It
    // is a 1D domain that has indices 1 through m.
    //
    const ProblemSpace = {1..m};

    //
    // A, B, and C are the three vectors, declared to store a variable of type
    // elemType for each index in ProblemSpace. As they are local to the `on`
    // statement that executes on the GPU, the memory for these arrays will be
    // allocated in GPU-accessible memory.
    //
    var A, B, C: [ProblemSpace] elemType;

    initVectors(B, C);  // Initialize the input vectors, B and C

    var execTime: [1..numTrials] real;  // an array of timings

    for trial in 1..numTrials {  // loop over the trials
      //
      // Capture the start time
      //
      const startTime = timeSinceEpoch().totalSeconds();

      //
      // The main loop: Iterate over the vectors A, B, and C in a
      // parallel, zippered manner referring to the elements as a, b, and c.
      // Compute the multiply-add on b and c, storing the result to a.
      // This forall loop will be offloaded onto the GPU.
      //
      forall (a, b, c) in zip(A, B, C) do
        a = b + alpha * c;

      //
      // Store the elapsed time
      //
      execTime(trial) = timeSinceEpoch().totalSeconds() - startTime;
    }

    on host {
      printResults(execTime);      // ...and print the results
    }
  }
}

//
// Print the problem size and number of trials
//
proc printConfiguration() {
  if printParams {
    printProblemSize(elemType, numVectors, m);
    writeln("Number of trials = ", numTrials, "\n");
  }
}

//
// Initialize vectors B and C using arbitrary values, and
// optionally print them to the console
//
proc initVectors(ref B, ref C) {
  forall b in B do b = 0.5;
  forall c in C do c = 0.5;

  if printArrays {
    writeln("B is: ", B, "\n");
    writeln("C is: ", C, "\n");
  }
}

//
// Print out the timings and the throughput
//
proc printResults(execTimes) {
  if printStats {
    var totalTime = 0.0;
    var minTime = max(real);
    var maxTime = min(real);
    for t in execTimes[2..] { // note: skip the first iteration
      totalTime += t;
      minTime = min(minTime, t);
      maxTime = max(maxTime, t);
    }
    var avgTime = totalTime/(numTrials-1);

    writeln("Execution time:");
    writeln("  avg = ", avgTime);
    writeln("  min = ", minTime);
    writeln("  max = ", maxTime);

    if SI {
      const GBPerSec =
        numVectors * numBytes(elemType) * (m / minTime) * 1e-9;
      writeln("Performance (GB/s) = ", GBPerSec);
    } else {
      const GiBPerSec =
        numVectors * numBytes(elemType) * (m / minTime) / (1<<30):real;
      writeln("Performance (GiB/s) = ", GiBPerSec);
    }
  }
}

Now let’s compile the example program! Whenever we want to evaluate the performance of a Chapel program, it’s imperative that we compile with the --fast flag. You can read more about this flag and others in the list of most useful flags.

Run the Chapel compiler, chpl, telling it to look for additional modules in the examples directory with the use of the -M flag. This is important because our chapel-stream program makes use of a module that is not in Chapel’s standard or package libraries.

chpl chapel-stream.chpl --fast -M=$CHPL_HOME/examples/benchmarks/hpcc

(Error message and explanation when -M is not included or path is not found)

When trying to compile a Chapel program, you may encounter an error similar to the following if all the source code isn’t in the same directory:

chapel-stream.chpl:10: error: cannot find module or enum named 'HPCCProblemSize'

The error is telling us about a missing module or enum named HPCCProblemSize. If we look at the code in chapel-stream.chpl around line 9, we can see that it is bringing another module into scope with the use HPCCProblemSize; statement.

6
7
8
9


//
// Use a common HPCC user module containing helper routines
//
use HPCCProblemSize;

The problem is that we have either asked Chapel to load a user-defined module without telling it where to find it, or the path we gave was not found. To fix the error, we can set or update the environment variable, CHPL_MODULE_PATH, or pass the -M flag with the correct path to our modules when we compile our programs.

Finally, we are ready to run the chapel-stream executable using our GPU!

./chapel-stream

Problem size = 67108864 (2**26)
Bytes per array = 536870912
Total memory required (GB) = 1.5
Number of trials = 10

Execution time:
  avg = 0.00423325
  min = 0.0041151
  max = 0.00458097
Performance (GB/s) = 391.39

Success! We have executed the compiled program on our NVIDIA GPU, and we never had to use anything more than regular Chapel code to do it!

Let’s look at how the performance compares with our initial check using the CUDA implementation. The throughput for each is very similar on my machine, although the exact value can vary. Both programs consistently report ~385–400 GB/s when performing the STREAM Triad benchmark on 2^26, or about 67.1 million elements.

Note that this same Chapel code runs on AMD GPUs as written! If you have access to a machine to try it on, I encourage you to check it out. See the vendor portability section of the GPU technote for more information.

Next steps and additional exploration

At this point, we have demonstrated that we can use Chapel to write code in WSL that runs directly on your NVIDIA GPU hosted in Windows. We have also shown that Chapel’s GPU performance has the capacity to match lower-level CUDA code in the STREAM Triad benchmark.

Recall that STREAM Triad is a relatively simple algorithm we used to demonstrate Chapel’s ability to write cross-platform capable code that performs similarly to native CUDA implementations — but Chapel is capable of so much more!

Check out these other excellent blog entries to see how you can use Chapel to put your gaming system to work for you, for science, or wherever your creativity leads you! The first two articles provide more information on writing Chapel programs that target the GPU, and the third gives an example of a problem that can be adapted to improve large-scale performance by exploiting the overwhelming core counts of the GPU. Happy coding!

For more information and other examples, benchmarks and tests, see the GPU technote.

chapel-wsl-demo.txt

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


# This is a list of the commands used in the Chapel NVIDIA WSL demo.
# These steps assume that you have already installed the NVIDIA driver in Windows
# and have enabled WSL2 with an Ubuntu distribution. These should be run in an
# Ubuntu terminal. Except for those that are explicitly stated to be added to
# the ~/.bashrc file, these commands should not need to be run again after the
# initial setup.

# Download CUDA
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run

# Install CUDA
sudo sh cuda_11.8.0_520.61.05_linux.run

# Get Chapel Dependencies
sudo apt-get update
sudo apt-get install gcc g++ m4 perl python3 python3-dev bash make mawk git pkg-config cmake
sudo apt-get install llvm-dev llvm clang libclang-dev libclang-cpp-dev libedit-dev

# Setup ENV for CUDA compilation - Add to ~/.bashrc
export PATH=/usr/local/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH

# Download Chapel
wget https://github.com/chapel-lang/chapel/releases/download/2.1.0/chapel-2.1.0.tar.gz
tar -xzf chapel-2.1.0.tar.gz

# Setup ENV for Chapel compilation - Add to ~/.bashrc
export CHPL_LLVM=system
export CHPL_LOCALE_MODEL=gpu

# Build Chapel
cd chapel-2.1.0
# This command adds `chpl` to the PATH and sets CHPL_HOME
source util/setchplenv.bash
make -j`nproc`
cd -

Measure the Performance of your Gaming GPU with Chapel

Share this article: