In the previous article in this series, we gave a basic introduction to scientific computing with Chapel by porting over the 4th step of the CFD Python: 12 steps to Navier-Stokes tutorial. The focus on a simple 1D example allowed us to introduce Chapel’s syntax as well as concepts like arrays, ranges, and forall loops.

In this post, we’ll use Steps 9 and 10 of the tutorial (which cover Laplace’s and Poisson’s Equations, respectively) to introduce some more advanced topics. We’ll then discuss a performance comparison with Python and C++/OpenMP to show that Chapel is able to deliver readable Python-like code while matching the performance of well-known HPC tools like OpenMP. Note that the code discussed in this article primarily follows the structure of Step 10 while also drawing some inspiration from Step 9.

In the next couple of posts, we’ll build off of the concepts introduced here to create a distributed, multi-node version of Step 11 — a full Navier-Stokes cavity-flow simulation. And to continue our performance investigation, we’ll also compare the Chapel code’s performance with an equivalent code written with C++/MPI/OpenMP.

Poisson’s Equation in Chapel

In the context of the Navier-Stokes Equations, Poisson’s equation is used to describe how a fluid’s pressure is distributed as a function of its momentum. It is also used in other scientific domains to describe things like electric potential or gravitational potential in terms of charge or mass distributions. As such, a performant and easy-to-read Poisson solver like the one we’ll discuss here can be useful in a variety of applications.

Mathematical Background

For our purposes, Poisson’s Equation can be written as follows, where $p$ is a scalar quantity representing a fluid’s pressure as a function of space, and $b$ is a scalar source term, which in the incompressible Navier-Stokes context can be derived from the fluid’s momentum:

$$ \nabla^2 p = b $$

For the 2D case, it can be written as:

$$ \dfrac{\partial^2 p}{\partial x^2} + \dfrac{\partial^2 p}{\partial y^2} = b $$

Notice that there is no time component. This means that the equation is not describing how the fluid’s pressure evolves over time. Rather, it is simply describing $p$’s shape by saying that at any point in space, the sum of its partial 2nd derivatives (curvature) in each direction should be equal to the source value at that point in space.

As a result, the finite difference method will be applied in a slightly different manner than in the previous post. Instead of using a stencil operation to compute the next time step, we’ll repeatedly apply a stencil to $p$ until we have an approximate solution that satisfies this discretization of Poisson’s Equation:

$$ \dfrac{p_{i+1,j} - 2p_{i,j} + p_{i-1,j}}{\Delta x^2} + \dfrac{p_{i,j+1} - 2p_{i,j} + p_{i,j-1}}{\Delta y^2} = b_{i,j} $$

Note that the $i$ and $j$ subscripts represent discrete $x$ and $y$ coordinates.

Because we’re not iterating for a predetermined number of discrete time steps, we’ll need a way of deciding how many times to apply the stencil operation. We can either iterate until the difference in $p$ from step to step is very small, meaning that the algorithm has asymptotically approached the solution (covered in Step 9), or we can iterate for some large number of steps, assuming that a sufficiently accurate solution will have been computed by that point (covered in Step 10). In this article, we’ll implement a code that can use either termination condition.

Source and Boundary Conditions

From Step 10 of the tutorial, we’ll use the following value for $b$ that should create a peak and valley in our solution around the two specified points:

$$ b(x, y) = \begin{cases} +100 &\text{if } x=0.25, y=0.75 \\ -100 &\text{if } x=0.75, y=0.25 \\ 0 &\text{otherwise} \end{cases} $$

And to keep things interesting, we’ll take the boundary conditions from Step 9, which includes a linear Dirichlet boundary condition on the right wall that we’ll be able to see in our solution:

$$ p(0, y) = 0, \> p(2, y) = y \quad \forall y \in [0, 2] $$ $$ \dfrac{\partial p(x, 0)}{\partial y} = 0, \> \dfrac{\partial p(x, 2)}{\partial y} = 0 \quad \forall x \in [0, 2] $$

Simulation Setup

To start things off, we declare some constants at the beginning of our program that we’ll use to implement the remainder of the simulation:

 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
config const xLen = 2.0,
             yLen = 2.0,
             nx = 31,
             ny = 31,
             sourceMag = 100.0,
             maxIters = 10000,
             tolerance = 1e-4;

const dx = xLen / (nx-1),
      dy = yLen / (ny-1);

// simulation settings
config const makePlots = false;
config param useTolerance = true;

These mostly follow the values from the tutorial, but we’ve also included a couple extra things:

Additionally, a configurable param is declared to designate the simulation’s termination condition. Unlike variables and constants, params have a known value at compile time and can affect how the program is compiled. In this code, we’ll use it to switch between tolerance-based and iteration-based termination.

With all that set up, we can move on to setting up our arrays.

Using 2D Arrays

In order to solve the 2D Poisson Equation in Chapel, we’ll need a way to represent the 2D quantities $p$ and $b$. Like the NumPy arrays used in the tutorial, Chapel’s arrays can have an arbitrary number of dimensions, so we can continue using them in a similar manner to the previous post.

The syntax we used last time for declaring a 1D array:

var a: [0..<nx] real;

is generalized for 2D arrays as follows:

var a: [0..<ny, 0..<nx] real;

Here, a is defined over the outer product of all the indices in the two ranges.

Although declaring our $p$ and $b$ arrays using the above syntax would work, it will be useful to store the set of indices that defines our problem space using a named variable. This is partially a stylistic choice, but it also makes it easier to modify the code for multi-locale execution later on (covered in the next post in this series).

We start by declaring some named index sets (or [note: Chapel's domains represent sets of indices. They can be used to define arrays (or in our case, groups of arrays that share the same index set) and can also be iterated over directly. See these docs for an introduction to domains. ]):

20
21
const space = {0..<ny, 0..<nx},
      spaceInner = space.expand(-1);

Here, the list of ranges within the curly braces on the first line creates a domain that will represent the index set for our simulation. The expand method is used to create another domain, spaceInner, which represents the same indices, excluding those around the boundary
(i.e.,{1..<(nx-1), 1..<(ny-1)}). We’ll use this later for applying a stencil operation to p.

With our named domains defined, we can declare our 2D arrays as follows:

var p, b: [space] real;

The main procedure

Many programming languages use a main function to designate the program’s starting point. Like Python, Chapel doesn’t require this, but does support it. For this code, which is longer than the diffusion code, we’ll try to keep things organized by breaking our problem into multiple subroutines, putting the array declaration and setup into the main procedure. Without going into too much detail, this procedure makes use of the array declaration syntax we discussed above, then sets up initial and boundary conditions, and finally calls another procedure, solvePoisson that we’ll discuss below:

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
proc main() {
    // define pressure and source arrays
    var p, b: [space] real = 0.0;

    // set boundary conditions
    p[..,    0] = 0;                     // p = 0 @ x = 0
    p[.., nx-1] = [j in 0..<ny] j * dy;  // p = y @ x = xLen
    p[0,    ..] = p[1, ..];              // dp/dy = 0 @ y = 0
    p[ny-1, ..] = p[ny-2, ..];           // dp/dy = 0 @ y = yLen

    // set sink/source terms
    b[3*ny/4, nx/4] = sourceMag;
    b[ny/4, 3*nx/4] = -sourceMag;

    // solve and plot
    if makePlots then plot(p, (yLen, xLen), title="initial", varName="p");

    solvePoisson(p, b);

    if makePlots then plot(p, (yLen, xLen), title="solution", varName="p");
}

The Solver and where Clauses

The algorithm for solving the Poisson Equation is essentially the same as the Diffusion Equation; the only difference is that this code will support two different termination conditions. It will:

For the tolerance condition, we have the following routine, which again breaks parts of the approach into helper procedures: poissonKernel and neumannBC:

45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
proc solvePoisson(ref p: [?d] real, const ref b: [d] real)
    where useTolerance == true
{
    // temporary copy of p
    var pn = p;

    // solve until tolerance is met or maxIters is reached
    var deltaL1Norm = 1.0, i = 0;
    while deltaL1Norm > tolerance && i < maxIters {
        p <=> pn;
        poissonKernel(p, pn, b);
        neumannBC(p);

        deltaL1Norm = (+ reduce (abs(p) - abs(pn))) / (+ reduce abs(pn));
        i += 1;
    }

    writeln("Converged in ", i, " iterations");
}

Note that we’re still preventing the simulation from running for more than maxIters iterations to avoid looping forever in case the simulation cannot converge; however, with the default simulation constants, this code will terminate well before maxIters is reached.

more on + reduce

Here, we’re using Chapel’s native reductions to compute the relative change in p’s L1 norm from the previous iteration.

Focusing in on the denominator, abs(pn) applies the math function abs to pn in a promoted element-wise manner (see more on promotion here), creating an iterable stream of values. The reduce operator then reduces those values into a single number by summing them. This is a simple and performant way to get the sum of the absolute-value of pn. A similar process occurs in the numerator where the reduced expression is slightly more complex.

In addition to sum-reductions, Chapel supports a variety of others. Check out the docs for more information.

And for the iteration condition, we have a slightly simplified routine that uses the same helper procedures:

65
66
67
68
69
70
71
72
73
74
75
76
77
proc solvePoisson(ref p: [?d] real, const ref b: [d] real)
    where useTolerance == false
{
    // temporary copy of p
    var pn = p;

    // solve for maxIters iterations
    for 1..maxIters {
        p <=> pn;
        poissonKernel(p, pn, b);
        neumannBC(p);
    }
}

Notice that both procedures are named solvePoisson and accept the same arguments. Typically, defining two such procedures would be an error; however these routines use where clauses to allow the compiler to select which of the two is called, making the double-definition legal. When the program is compiled, the compiler looks at the value of useTolerance and uses the matching overload of solvePoisson.

Note that useTolerance is defined with the config keyword, so a command-line argument can be passed to the compiler with the -s or --set flag to change its value. For example, this command:

chpl nsPoisson.chpl --set useTolerance=false

would compile the program to use iteration-based termination.

The Poisson Kernel

The poissonKernel procedure used by both of the above routines applies the discretized Poisson Equation to p. The equation is rearranged to fit our algorithm:

$$ p_{i,j} = \dfrac{(p_{i-1,j} + p_{i+1,j}) \Delta y^2 + (p_{i,j-1} + p_{i,j+1}) \Delta x^2 + b_{i,j} \Delta x^2 \Delta y^2}{2 (\Delta x^2 + \Delta y^2)} $$

Where the $p$ on the left corresponds to p, and each $p$ on the right corresponds to pn. In Chapel, we have the following:

79
80
81
82
83
84
85
86
87
proc poissonKernel(ref p: [] real, const ref pn: [] real, const ref b: [] real) {
    forall (i, j) in spaceInner {
        p[i, j] = (
            dy**2 * (pn[i, j+1] + pn[i, j-1]) +
            dx**2 * (pn[i+1, j] + pn[i-1, j]) -
            dx**2 * dy**2 * b[i, j]
        ) / (2.0 * (dx**2 + dy**2));
    }
}

Note that a forall loop is used, indicating to the compiler that this loop is order-independent and that spaceInner’s parallel iterator can be used to break up the work across multiple tasks. This permits the code to be executed in parallel on multi-core hardware. It’s also notable that nested loops are unnecessary even though we’re iterating over a 2D array’s indices because spaceInner’s iterator yields 2-tuples.

The solvePoisson procedures also call the following boundary condition function to apply Neumann boundary conditions to the top and bottom walls. The other boundary conditions don’t have to be applied each iteration because the elements around the boundary of p are never updated after being set:

89
90
91
92
proc neumannBC(ref p: [] real) {
    p[0,    ..] = p[1, ..];    // dp/dy = 0 @ y = 0
    p[ny-1, ..] = p[ny-2, ..]; // dp/dy = 0 @ y = yLen
}

Running and generating plots

To try the simulation out on your own machine, download and place the following three files in the same directory. The first is the full code we just went over, while the second has a helper procedure for writing p to a file and calling a Python plotting script (the third file).

nsPoisson.chpl
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import surfPlot.plot;

// set up simulation constants
config const xLen = 2.0,
             yLen = 2.0,
             nx = 31,
             ny = 31,
             sourceMag = 100.0,
             maxIters = 10000,
             tolerance = 1e-4;

const dx = xLen / (nx-1),
      dy = yLen / (ny-1);

// simulation settings
config const makePlots = false;
config param useTolerance = true;

// define problem space
const space = {0..<ny, 0..<nx},
      spaceInner = space.expand(-1);

proc main() {
    // define pressure and source arrays
    var p, b: [space] real = 0.0;

    // set boundary conditions
    p[..,    0] = 0;                     // p = 0 @ x = 0
    p[.., nx-1] = [j in 0..<ny] j * dy;  // p = y @ x = xLen
    p[0,    ..] = p[1, ..];              // dp/dy = 0 @ y = 0
    p[ny-1, ..] = p[ny-2, ..];           // dp/dy = 0 @ y = yLen

    // set sink/source terms
    b[3*ny/4, nx/4] = sourceMag;
    b[ny/4, 3*nx/4] = -sourceMag;

    // solve and plot
    if makePlots then plot(p, (yLen, xLen), title="initial", varName="p");

    solvePoisson(p, b);

    if makePlots then plot(p, (yLen, xLen), title="solution", varName="p");
}

proc solvePoisson(ref p: [?d] real, const ref b: [d] real)
    where useTolerance == true
{
    // temporary copy of p
    var pn = p;

    // solve until tolerance is met or maxIters is reached
    var deltaL1Norm = 1.0, i = 0;
    while deltaL1Norm > tolerance && i < maxIters {
        p <=> pn;
        poissonKernel(p, pn, b);
        neumannBC(p);

        deltaL1Norm = (+ reduce (abs(p) - abs(pn))) / (+ reduce abs(pn));
        i += 1;
    }

    writeln("Converged in ", i, " iterations");
}

proc solvePoisson(ref p: [?d] real, const ref b: [d] real)
    where useTolerance == false
{
    // temporary copy of p
    var pn = p;

    // solve for maxIters iterations
    for 1..maxIters {
        p <=> pn;
        poissonKernel(p, pn, b);
        neumannBC(p);
    }
}

proc poissonKernel(ref p: [] real, const ref pn: [] real, const ref b: [] real) {
    forall (i, j) in spaceInner {
        p[i, j] = (
            dy**2 * (pn[i, j+1] + pn[i, j-1]) +
            dx**2 * (pn[i+1, j] + pn[i-1, j]) -
            dx**2 * dy**2 * b[i, j]
        ) / (2.0 * (dx**2 + dy**2));
    }
}

proc neumannBC(ref p: [] real) {
    p[0,    ..] = p[1, ..];    // dp/dy = 0 @ y = 0
    p[ny-1, ..] = p[ny-2, ..]; // dp/dy = 0 @ y = yLen
}
surfPlot.chpl
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
use IO;
import Subprocess.spawn;

proc plot(
    const ref a : [?d] real,
    lens: 2*real,
    title: string,
    varName: string
) throws {
    // create a data file
    const fileName = title + ".dat";
    var fw = openWriter(fileName, locking=false);

    // print array data
    fw.write(a);
    fw.close();

    // call the python plotting script
    spawn(["python3", "surfPlot.py", fileName, title, varName, lens[0]:string, lens[1]:string]);
}
surfPlot.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import sys
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm

filePath = sys.argv[1]
Z = np.loadtxt(filePath, delimiter=' ')

x_len = float(sys.argv[4])
y_len = float(sys.argv[5])

x = np.linspace(0.0, x_len, Z.shape[1])
y = np.linspace(0.0, y_len, Z.shape[0])
X, Y = np.meshgrid(x, y)

fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, cmap=cm.viridis)

ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel(sys.argv[3])
ax.set_title(sys.argv[2])

ax.view_init(elev=30, azim=230)

plt.savefig(filePath.replace('.dat', '.png'))

For the plotting to work, make sure you have python3 in your path with numpy and matplotlib available in your Python environment.

Run the following command to compile the program:

chpl nsPoisson.chpl --fast

And try running it with:

./nsPoisson --sourceMag=500 --makePlots=true

After the simulation finishes, the plotting script should generate the following figure (named solution.png, in the same directory as the plotting script). Try adjusting the config consts
(e.g., --sourceMag=-200) to see how they affect the results.

Poisson Solution

And with that, we have a full Poisson Equation solver that contains all the essential elements from Steps 9 and 10 of the Python tutorial. Not only that, it makes use of parallel iteration and occupies just under 100 lines of code. Because Chapel is a high-performance compiled language, we can also expect our code to get great performance. To put that to the test, let’s compare with a parallel implementation in another high performance compiled language that’s often used in scientific computing.

Performance Comparison with C++/OpenMP

[note: Caveat: I am not a C++/OpenMP expert, so there are certainly places where the performance, and perhaps readability, of the C++ code could be improved; however, the purpose of this exercise is to show how a straightforward Chapel implementation compares to a straightforward C++/OpenMP implementation without pulling out all the stops for either technology. ] the Chapel code to C++, using OpenMP to parallelize when possible, we have the following source and header files:

nsPoisson.cpp
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
#include "nsPoisson.h"

int main(int argc, const char *argv[]) {
  // define default values of constants
  unordered_map<string, variant<int, double>> default_args = {
      {"xLen", 2.0},        {"yLen", 2.0},       {"nx", 31},
      {"ny", 31},           {"tolerance", 1e-4}, {"maxIters", 10000},
      {"sourceMag", 100.0}, {"makePlots", 0}};

  // overwrite defaults with command line arguments if present
  parse_args_with_defaults(argc, argv, default_args);

  // set final values of simulation constants
  const double x_len = get<double>(default_args.at("xLen")),
               y_len = get<double>(default_args.at("yLen")),
               l1_tol = get<double>(default_args.at("tolerance")),
               source_mag = get<double>(default_args.at("sourceMag"));
  const int nx = get<int>(default_args.at("nx")),
            ny = get<int>(default_args.at("ny")),
            max_iters = get<int>(default_args.at("maxIters")),
            make_plots = get<int>(default_args.at("makePlots"));

  const double dx = x_len / (nx - 1), dy = y_len / (ny - 1);

  // define pressure and source arrays
  vector<vector<double>> p(ny, vector<double>(nx, 0.0));
  vector<vector<double>> b(ny, vector<double>(nx, 0.0));

  // set boundary conditions
  for (int j = 0; j < ny; j++) {
    p[j][0] = 0;           // p = 0 @ x = 0
    p[j][nx - 1] = j * dy; // p = y @ x = xLen
  }
  for (int i = 0; i < nx; i++) {
    p[0][i] = p[1][i];           // dp/dy = 0 @ y = 0
    p[ny - 1][i] = p[ny - 2][i]; // dp/dy = 0 @ y = yLen
  }

  // set sink/source terms
  b[3 * ny / 4][nx / 4] = source_mag;
  b[ny / 4][3 * nx / 4] = -source_mag;

  // solve and plot
  if (make_plots) {
    plot(p, x_len, y_len, "initial_cpp", "p");
  }

  solve_poisson(p, b, dx, dy, l1_tol, max_iters);

  if (make_plots) {
    plot(p, x_len, y_len, "solution_cpp", "p");
  }

  return 0;
}

void solve_poisson(vector<vector<double>> &p, vector<vector<double>> const &b,
                   const double dx, const double dy, const double tolerance,
                   const int max_iters) {
  // temporary copy of p
  auto pn = p;

#ifdef TERMONTOL
  // solve until tolerance is met or maxIters is reached
  double delta_L1_norm = 1;
  int i = 0;
  while (delta_L1_norm > tolerance && i < max_iters) {
    p.swap(pn);
    poisson_kernel(p, pn, b, dx, dy);
    neumann_bc(p);

    delta_L1_norm = delta_L1(p, pn);
    i++;
  }

  cout << "Converged in " << i << " iterations" << endl;
#else
  // solve for maxIters iterations
  for (int i = 0; i < max_iters; i++) {
    p.swap(pn);
    poisson_kernel(p, pn, b, dx, dy);
    neumann_bc(p);
  }
#endif
}

void poisson_kernel(vector<vector<double>> &p, vector<vector<double>> const &pn,
                    vector<vector<double>> const &b, const double dx,
                    const double dy) {
#pragma omp parallel for
  for (int i = 1; i < p.size() - 1; i++) {
    for (int j = 1; j < p[0].size() - 1; j++) {
      p[i][j] = (dy * dy * (pn[i][j + 1] + pn[i][j - 1]) +
                 dx * dx * (pn[i + 1][j] + pn[i - 1][j]) -
                 dx * dx * dy * dy * b[i][j]) /
                (2.0 * (dx * dx + dy * dy));
    }
  }
}

void neumann_bc(vector<vector<double>> &p) {
  const auto nx = p[0].size();
  const auto ny = p.size();

#pragma omp parallel for
  for (int i = 0; i < nx; i++) {
    p[0][i] = p[1][i];           // dp/dy = 0 @ y = 0
    p[ny - 1][i] = p[ny - 2][i]; // dp/dy = 0 @ y = yLen
  }
}

double delta_L1(vector<vector<double>> const &p,
                vector<vector<double>> const &pn) {
  double diffSum = 0;
  double nSum = 0;


#pragma omp parallel for reduction(+ : diffSum, nSum)
  for (int i = 0; i < p.size(); i++) {
    for (int j = 0; j < p[0].size(); j++) {
      diffSum += abs(p[i][j]) - abs(pn[i][j]);
      nSum += abs(pn[i][j]);
    }
  }

  return diffSum / nSum;
}
nsPoisson.h
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#include "utils.h"
#include <cmath>
#include <iostream>
#include <omp.h>
#include <unordered_map>
#include <variant>
#include <vector>

using namespace std;

void solve_poisson(vector<vector<double>> &p, vector<vector<double>> const &b,
                   double dx, double dy, double tolerance, int max_iters);

void poisson_kernel(vector<vector<double>> &p, vector<vector<double>> const &pn,
                    vector<vector<double>> const &b, double dx, double dy);

void neumann_bc(vector<vector<double>> &p);
double delta_L1(const vector<vector<double>> &p,
                const vector<vector<double>> &pn);

Besides being longer, there are few notable differences with the Chapel code:

Here are some other source files that will be useful for running this code yourself (note: make sure to use -DCMAKE_BUILD_TYPE=Release when invoking cmake):

utils.cpp
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#include "utils.h"

void plot(vector<vector<double>> const &a, double x_len, double y_len,
          string const &title, string const &varName) {
  const auto nx = a[0].size();
  const auto ny = a.size();
  const auto file_name = title + ".dat";

  // print array data
  ofstream dataFile;
  dataFile.open(file_name);

  // write data
  dataFile << setprecision(10);
  for (int i = 0; i < ny; i++) {
    for (int j = 0; j < nx; j++) {
      if (j > 0) {
        dataFile << " ";
      }
      dataFile << a[i][j];
    }
    dataFile << "\n";
  }
  dataFile.close();

  // call the python plotting script
  stringstream plotCmd;
  plotCmd << "python3 surfPlot.py " << file_name << " " << title << " "
          << varName << " " << x_len << " " << y_len;
  const string plotCmdStr = plotCmd.str();
  const char *plotCmdCString = plotCmdStr.c_str();
  std::system(plotCmdCString);
}

void parse_args_with_defaults(
    int argc, const char *argv[],
    unordered_map<string, variant<int, double>> &defaults) {
  for (int i = 0; i < argc; i++) {
    string arg(argv[i]);
    if (arg[0] == '-' && arg[1] == '-' && arg.find('=') != string::npos) {
      size_t eq_pos = arg.find('=');
      string argName = arg.substr(2, eq_pos - 2);

      auto search = defaults.find(argName);
      if (search != defaults.end()) {
        if (holds_alternative<int>(search->second)) {
          try {
            search->second = stoi(arg.substr(eq_pos + 1, arg.size()));
          } catch (...) {
            cout << "Warning: invalid value for: '" << argName
                 << "'; keeping default value (" << get<int>(search->second)
                 << ")\n";
          }
        } else {
          try {
            search->second = stod(arg.substr(eq_pos + 1, arg.size()));
          } catch (...) {
            cout << "Warning: invalid value for: '" << argName
                 << "'; keeping default value (" << get<double>(search->second)
                 << ")\n";
          }
        }
      } else {
        cout << "Warning: argument: '" << argName
             << "' doesn't have a default value; ignoring\n";
      }
    }
  }
}
utils.h
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#include <fstream>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <unordered_map>
#include <variant>
#include <vector>

using namespace std;

void plot(vector<vector<double>> const &a, double x_len, double y_len,
          string const &title, string const &varName);

void parse_args_with_defaults(
    int argc, const char *argv[],
    unordered_map<string, variant<int, double>> &defaults);
CMakeLists.txt
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

# basic project setup
cmake_minimum_required(VERSION 3.1...3.24)
project(
  nsPoisson
  VERSION 1.0
  LANGUAGES CXX
)
set(CMAKE_CXX_STANDARD 17)

# define library for shared utility code
add_library(ns_utils utils.cpp utils.h)

add_executable(nsPoisson nsPoisson.cpp nsPoisson.h)

find_package(OpenMP)
if(OpenMP_CXX_FOUND)
    message(OpenMP_CXX_FOUND="${OpenMP_CXX_FOUND}")
    target_link_libraries(nsPoisson PUBLIC OpenMP::OpenMP_CXX ns_utils)
endif()

if($ENV{TERMONTOL})
    add_compile_definitions(TERMONTOL)
endif()

Experiment Setup and Results:

Both the Chapel and C++ codes were compiled to run for 10,000 iterations (no tolerance termination) on various problem sizes. To keep the simulation numerically stable, the ratio of physical length per finite-difference point was kept constant at r=2.0/30. The programs were executed with the following arguments:

./nsPoisson --nx=<n> --ny=<n> --xLen=<n*r> --yLen=<n*r>

where n ranged from 64 to 2048 by powers of 2. So, the total number of points ranged from 4096 to just over four million.

Testing was performed on a Linux system with the following CPU specs: 2.8GHz 32-Core AMD EPYC, 256 MiB L3 cache, hyper-threading enabled. GCC 7.5 was used to compile the C++ code, and it was executed with OMP_PROC_BIND=true and OMP_NUM_THREADS=32. The Chapel codes were compiled with version 2.0 using the default linux64 configuration. Execution times were computed as the average of three trials.

This experiment produced the following results:

Performance Comparison: Chapel & C++

For smaller problem sizes, the Chapel and C++ code perform similarly, with C++ coming out slightly ahead. As the problem size increases, Chapel begins to [note: This could be because the C++ code uses nested vectors to represent the 2D arrays, meaning that the outer vector contains pointers to the inner vectors (which, in general, are not allocated near each other on the heap). As such, cache misses are potentially more likely as compared to the Chapel code whose 2D arrays allocate all the data in one contiguous block in memory. For larger problems, where the entire array cannot fit in cache, this effect may be harming the C++ code's performance more significantly. To alleviate this, the logical 2D arrays could instead be implemented as 1D vectors with length nx * ny. ] by a notable margin.

To put the above results in perspective, we also looked at the performance of the Python code run on the same hardware with the same problem sizes. This included a normal Python script (essentially copied from the tutorial), and a [note: Again, I'm not an expert with Numba or other Python parallelization techniques. It's certainly possible that one could squeeze more performance out of the Python code with more effort and expertise. It is also possible to run the code on an accelerated build of NumPy that makes use of OpenMP-enabled BLAS or various vendor-accelerated backends; however, despite a significant amount of effort, I was unable to procure such a build on my testing hardware. ] of the same script which uses the @stencil and @jit(nopython=True, parallel=True) annotations to improve performance.

Performance Comparison: Chapel, C++, Python

Here, as you may expect, pure Python is slower than the compiled languages by orders of magnitude on the larger problem sizes. Numba is able to produce competitive performance up to a point, but slows down significantly for the largest problem size.

Conclusion

This article aimed to use a common scientific computation to show that Chapel’s productive syntax (including its powerful multidimensional arrays) and its excellent performance (powered by its first class parallel computing features) make it a great candidate for use in scientific computing. Not only does it fit steps 9 and 10 of the 12 Steps to Navier-Stokes tutorial — with helpful command-line features for configuring the simulation — into about 100 lines of code; it’s also able to outperform compiled Python and Parallel C++.

In this series’ next post, we’ll dive into Chapel’s distributed programming features and see how easy it is to modify the Poisson solver to run across multiple compute nodes. After that, we’ll continue stepping up the complexity of our computations by looking at Step 11 of the tutorial — a full cavity-flow Navier-Stokes simulation. This simulation will also be able to run on a multi-node cluster or supercomputer, and to show that Chapel’s productivity doesn’t come with a performance penalty, we’ll compare its performance with a C++/MPI/OpenMP version of the same code.

The Chapel Team would like to express its sincere gratitude towards the Barba Group for creating and maintaining the CFD Python tutorials.