Navier-Stokes in Chapel — Distributed Cavity-Flow Solver

This article is a direct continuation of Part 3 in this series, where we discussed several aspects of Chapel’s distributed programming capabilities, and used those to create a distributed version of the code from Part 2. If you haven’t already, you may want to read the previous article before continuing.

Here, we’ll use those distributed-programming features to port one of the full Navier-Stokes simulation codes from the Barba Group’s 12 steps to Navier-Stokes tutorial to Chapel, allowing us to efficiently run simulations on any hardware ranging from a laptop to a supercomputer. And by comparing with a version written with C++, MPI, and OpenMP, we’ll see that the Chapel code, which is only slightly longer than the original Python code, is able to match the performance of the industry-standard HPC software stack.

Distributed Cavity-Flow Simulation Code

Step 11 of the Tutorial discusses Cavity flow. This kind of Navier-Stokes simulation shows how a fluid moves and evolves in a closed cavity (a rectangle in our case) given some boundary conditions. Each step of our simulation will compute a momentum vector for all of the finite difference points in our 2D grid (this is a composition of the u and v arrays discussed below). It also uses the Poisson Solver to compute the fluid’s pressure, p, at each finite difference point. Both quantities will be rendered in the final solution plot.

The full Navier-Stokes code follows a very similar structure to the Poisson code from the previous articles, and only takes a few deviations from Step 11 of the tutorial. As such, I’ll give a high-level summary of the code that only calls out the areas where those differences pop up, rather than going through it line-by-line.

Simulation Constants and Settings

There are now two time-step constants nt and nit, where the Poisson solver only had one. The first corresponds to actual time, and the second is the number of iterations to use for solving the Poisson equation during each time step (we also could have used the tolerance-based approach from the previous articles, but a constant number of iterations is used here for simplicity and to follow the Python tutorial more closely). This means that for every real time step, the program will run 50 Poisson iterations by default.

 6
 7
 8
 9
10
11
12
13
14


config const xLen = 2.0,
             yLen = 2.0,
             nx = 41,
             ny = 41,
             nt = 500,
             nit = 50,
             dt = 0.001,
             rho = 1.0,
             nu = 0.1;

There is also a new downSampleFactor constant (not included in the Python tutorial) that tells the plotting code which fraction of array elements to use in the final plot. The default of 2 means that 1/2 of the indices along each axis are used to generate the figure (i.e., 1/4 of the indices are included in the plots). For realistically large data sizes, this value should be increased to avoid generating very large plots and data files.

20
21


config const createPlots = false,
             downsampleFactor = 2;

Domain Distribution

Instead of defining the space domain over the Locales array directly, as we did in the distributed Poisson solver, we use a 2D version of Locales that has a shape of (Locales.size, 1). This indicates to the distribution that we only want to distribute blocks along the first dimension. For a small number of locales, this will give us better performance because each locale only has to communicate with two neighbors instead of four.

What about for larger numbers of locales?

For problem sizes and locale counts that are larger than what we’ll discuss in this article, using the default 2D distribution makes more sense because it avoids high surface-area-to-volume ratios on each locale’s block (i.e., with many locales, the blocks would be very long and skinny rectangles). In that case, each locale would spend a sub-optimally large portion of its time communicating with neighbors instead of running stencil computations, which would result in reduced overall performance.

24
25
26


const tl = reshape(Locales, {0..<Locales.size, 0..0}),
      space = stencilDist.createDomain({0..<ny, 0..<nx}, fluff=(1,1), targetLocales=tl),
      spaceInner = space.expand(-1);

Arrays & Boundary Conditions

We now have three distributed arrays: p, u, and v defined over space, representing pressure, x-directed momentum, and y-directed momentum respectively.

The boundary conditions are modified such that there is a rightward flow on the top of the domain (like in the tutorial), and also a leftward flow on the bottom of the domain (instead of the Dirichlet condition). This deviation from the tutorial isn’t particularly important, but results in an interesting final solution.

29
30
31
32
33
34


    // define distributed arrays
    var p, u, v : [space] real;

    // apply boundary conditions
    u[0,    ..] = 1.0;
    u[ny-1, ..] = -1.0;

Halo Updates

As with the distributed Poisson solver, calls to updateFluff are included before each array is used in a stencil computation. This ensures that the “halo” regions in the arrays have the latest data from their neighboring locales:

59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80


    for 1..nt {
        u <=> un;
        v <=> vn;

        // compute the portion of p that depends only on u and v
        computeB(b, un, vn);
        b.updateFluff();

        // iteratively solve for p
        for 1..nit {
            p <=> pn;
            poissonKernel(p, pn, b);
            neumannBC(p);
            p.updateFluff();
        }

        // compute u and v
        computeU(u, un, vn, p);
        computeV(v, un, vn, p);
        u.updateFluff();
        v.updateFluff();
    }

The Full Code

The remainder of the code includes some familiar infrastructure and some new stencil computations which are pretty mathematically intensive: computeB, computeU, and computeV. This article won’t get into the details of the math; however, the original Python tutorial is an excellent resource for understanding what those computations are doing.

The full Chapel code can be viewed/downloaded here:

nsStep11.chpl

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132


use FlowPlot;
use StencilDist;
use Time;

// set up simulation constants
config const xLen = 2.0,
             yLen = 2.0,
             nx = 41,
             ny = 41,
             nt = 500,
             nit = 50,
             dt = 0.001,
             rho = 1.0,
             nu = 0.1;

const dx = xLen / (nx - 1),
      dy = yLen / (ny - 1);

// simulation settings
config const createPlots = false,
             downsampleFactor = 2;

// define distributed 2D domains
const tl = reshape(Locales, {0..<Locales.size, 0..0}),
      space = stencilDist.createDomain({0..<ny, 0..<nx}, fluff=(1,1), targetLocales=tl),
      spaceInner = space.expand(-1);

proc main() {
    // define distributed arrays
    var p, u, v : [space] real;

    // apply boundary conditions
    u[0,    ..] = 1.0;
    u[ny-1, ..] = -1.0;

    // run simulation
    if createPlots then plot(p, u, v, (xLen, yLen), title="Initial", downsampleFactor);
    var t = new stopwatch();
    t.start();

    solveCavityFlow(p, u, v);

    writeln("Ran simulation in ", t.elapsed(), " seconds");
    if createPlots then plot(p, u, v, (xLen, yLen), title="Solution", downsampleFactor);

    // print result
    writef("mean pressure: %.4r\n", + reduce p / p.size);
}

proc solveCavityFlow(ref p, ref u, ref v) {
    // temporary copies of p, u, and v
    var pn = p,
        un = u,
        vn = v;

    // source term for pressure
    var b : [space] real;

    for 1..nt {
        u <=> un;
        v <=> vn;

        // compute the portion of p that depends only on u and v
        computeB(b, un, vn);
        b.updateFluff();

        // iteratively solve for p
        for 1..nit {
            p <=> pn;
            poissonKernel(p, pn, b);
            neumannBC(p);
            p.updateFluff();
        }

        // compute u and v
        computeU(u, un, vn, p);
        computeV(v, un, vn, p);
        u.updateFluff();
        v.updateFluff();
    }
}

proc computeB(ref b, const ref u, const ref v) {
    forall (i, j) in spaceInner {
        const du = u[i, j+1] - u[i, j-1],
              dv = v[i+1, j] - v[i-1, j];

        b[i, j] = rho * ((1.0 / dt) * (du / (2.0 * dx) + dv / (2.0 * dy)) -
                         (du / (2.0 * dx))**2 -
                         (dv / (2.0 * dy))**2 -
                         2.0 * ((u[i+1, j] - u[i-1, j]) / (2.0 * dy) *
                                (v[i, j+1] - v[i, j-1]) / (2.0 * dx)));
    }
}

proc poissonKernel(ref p: [] real, const ref pn: [] real, const ref b: [] real) {
    forall (i, j) in spaceInner {
        p[i, j] = (dy**2 * (pn[i, j+1] + pn[i, j-1]) +
                   dx**2 * (pn[i+1, j] + pn[i-1, j]) -
                   dx**2 * dy**2 * b[i, j]) /
                   (2.0 * (dx**2 + dy**2));
    }
}

proc neumannBC(ref p) {
    p[.., 0] = p[.., 1];
    p[.., nx - 1] = p[.., nx - 2];
}

proc computeU(ref u, const ref un, const ref vn, const ref p) {
    forall (i, j) in spaceInner {
        u[i, j] = un[i, j] - un[i, j] * (dt / dx) * (un[i, j] - un[i, j-1]) -
                             vn[i, j] * (dt / dy) * (un[i, j] - un[i-1, j]) -
                  dt / (2.0 * rho * dx) * (p[i, j+1] - p[i, j-1]) +
                  nu * ((dt / dx**2) *
                            (un[i, j+1] - 2.0 * un[i, j] + un[i, j-1]) +
                        (dt / dy**2) *
                            (un[i+1, j] - 2.0 * un[i, j] + un[i-1, j]));
    }
}

proc computeV(ref v, const ref un, const ref vn, const ref p) {
    forall (i, j) in spaceInner  {
        v[i, j] = vn[i, j] - un[i, j] * (dt / dx) * (vn[i, j] - vn[i, j-1]) -
                             vn[i, j] * (dt / dy) * (vn[i, j] - vn[i-1, j]) -
                  dt / (2.0 * rho * dy) * (p[i+1, j] - p[i-1, j]) +
                  nu * ((dt / dx**2) *
                            (vn[i, j+1] - 2.0 * vn[i, j] + vn[i, j-1]) +
                        (dt / dy**2) *
                            (vn[i+1, j] - 2.0 * vn[i, j] + vn[i-1, j]));
    }
}

The following scripts are used for generating plots (enabled with the --createPlots=true command line argument). Note that your Python environment must have Numpy and Matplotlib installed for plotting to work.

FlowPlot.chpl

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


use IO, Subprocess;

proc plot(
    const ref p: [?d] real,
    const ref u: [d] real,
    const ref v: [d] real,
    lens: 2*real,
    title: string,
    downsampleFactor: int = 1
) throws {
    const sampleDom = {d.dim(0) by downsampleFactor, d.dim(1) by downsampleFactor};
    const fileNames = [n in ['p', 'u', 'v']] "%s_%s.dat".format(title, n);

    // create data files
    for (arr, fileName) in zip((p, u, v), fileNames) {
        openWriter(fileName, locking=false).write(arr[sampleDom]);
    }

    // call the python plotting script
    Subprocess.spawn([
        "python3",
        "flowPlot.py",
        title,
        lens[0]:string,
        lens[1]:string
    ]);
}

flowPlot.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


import sys
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm

title = sys.argv[1]

p = np.loadtxt(f"{title}_p.dat", delimiter=' ')
u = np.loadtxt(f"{title}_u.dat", delimiter=' ')
v = np.loadtxt(f"{title}_v.dat", delimiter=' ')

x_len = float(sys.argv[2])
y_len = float(sys.argv[3])

x = np.linspace(0, x_len, p.shape[1])
y = np.linspace(0, y_len, p.shape[0])
X, Y = np.meshgrid(x, y)

fig, ax = plt.subplots(1, 2,  figsize=(11,7), dpi=100)
cf = ax[1].contourf(X, Y, p, alpha=0.5, cmap=cm.viridis)
fig.colorbar(cf, label='Pressure')

ax[0].quiver(X, Y, u, v)
ax[0].set_title("Quiver Flow Plot")

ax[1].streamplot(X, Y, u, v)
ax[1].set_xlabel('x')
ax[1].set_ylabel('y')
ax[1].set_title("Pressure Gradient")

fig.suptitle(title)
plt.savefig(title + '.png')

Compiling and Running

With the above files downloaded into the same directory, the following commands will compile the program and execute it on four locales:

chpl nsStep11.chpl --fast
./nsStep11 -nl 4 --createPlots=true

A figure named Solution.png should appear in the same directory:

The opposing boundary conditions on the top and bottom of the domain create an hourglass-shaped flow through the cavity. Modifying the boundary conditions and other simulation parameters can result in very different looking solutions.

The program also prints the mean value of p, and the running time for the solveCavityFlow procedure. Note that if you are emulating multiple locales on a single computer, peak performance should not be expected. To see performance advantages from distribution, the code should be run with a large problem size across multiple nodes of a cluster or supercomputer. Also note that Chapel 2.2 includes some performance optimizations[note:As of 2.2, the Chapel compiler and library work together to analyze parallel loops that access stencil-distributed arrays. If the accesses are known to be local ahead of time (i.e., they are within that locale’s region of the array, or its halo buffers), then a fast path is used that does not check for locality before accessing memory. There were also improvements to the stencil distribution itself to reduce communication overhead, as well as improvements to Chapel’s array slicing expressions that reduced overhead in this code’s boundary-condition computation. See the 2.2 Release Announcement for more information.] related to the Stencil Distribution, so using version 2.2 or later is recommended for running this code.

Comparison with C++/MPI/OpenMP

The C++/MPI/OpenMP counterparts[note:Caveat: I am not an expert with C++/OpenMP/MPI, so there are certainly places where the performance, and perhaps readability, of the C++ code could be improved; however, the purpose of this exercise is to show how a straightforward Chapel implementation compares to a straightforward C++/OpenMP/MPI implementation without pulling out all the stops for either technology.] to the Chapel code can be viewed and downloaded here:

nsStep11.cpp

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296


#include "nsStep11.h"

/*
    Array distribution scheme:

    |---------- (nx == my_nx) --------|

    *---------------------------------*
    | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    ---     ---
    |                                 |     |       |
    |             rank 0              |  (my_ny)    |
    |                                 |     |       |
    | =============================== |    ---      |
    *---------------------------------*             |
    | =============================== |             |
    |                                 |             |
    |             rank 1              |             |
    |                                 |             |
    | =============================== |            (ny)
    *---------------------------------*             |
    .                .                .             |
    .                .                .             |
    .                .                .             |
    *---------------------------------*             |
    | =============================== |             |
    |                                 |             |
    |             rank N              |             |
  ^ |                                 |             |
  | | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |            ---
  Y *---------------------------------*
    X ->

    Key:
    ------------------------
    * ==== : halo buffer (holds neighbor's values after 'update_halos' is called)
    * xxxx : unused halo buffer
*/

int main(int argc, char *argv[]) {
  // initialize MPI
  MPI_Init(&argc, &argv);
  int size, rank;
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  // define default values of constants
  unordered_map<string, variant<int, double>> default_args = {
      {"xLen", 2.0},     {"yLen", 2.0},
      {"nx", 41},        {"ny", 41},
      {"nt", 500},       {"nit", 50},
      {"dt", 0.001},     {"rho", 1.0},
      {"nu", 0.1},       {"downsampleFactor", 2},
      {"createPlots", 0}};

  // overwrite defaults with command line arguments if present
  parse_args_with_defaults(argc, argv, default_args);

  const double x_len = get<double>(default_args.at("xLen")),
               y_len = get<double>(default_args.at("yLen")),
               dt = get<double>(default_args.at("dt")),
               rho = get<double>(default_args.at("rho")),
               nu = get<double>(default_args.at("nu"));

  const int nx = get<int>(default_args.at("nx")),
            ny = get<int>(default_args.at("ny")),
            nt = get<int>(default_args.at("nt")),
            nit = get<int>(default_args.at("nit")),
            downsample_factor = get<int>(default_args.at("downsampleFactor")),
            create_plots = get<int>(default_args.at("createPlots"));

  const double dx = x_len / (nx - 1), dy = y_len / (ny - 1);

  // define the sizes of the local sub-arrays - not including halos
  // (give remainder elements to rank 0)
  // { my_ny, my_nx }
  const int my_space[2] = {ny / size + (rank == 0 ? ny % size : 0), nx};

  // define the iteration bounds
  // (halos on the north and south edges of the global domain are unused)
  // { {y_start, y_stop}, {x_start, x_stop} }
  const int my_space_inner[2][2] = {
      {(rank == 0) ? 2 : 1, my_space[0] + ((rank == size - 1) ? 0 : 1)},
      {1, my_space[1] - 1}};

  // initialize this rank's portion of distributed arrays
  vector<vector<double>> p(my_space[0] + 2, vector<double>(my_space[1], 0.0));
  vector<vector<double>> u(my_space[0] + 2, vector<double>(my_space[1], 0.0));
  vector<vector<double>> v(my_space[0] + 2, vector<double>(my_space[1], 0.0));
  vector<vector<double>> b(my_space[0] + 2, vector<double>(my_space[1], 0.0));

  // apply boundary conditions
  if (rank == 0) {
    for (int j = 0; j < my_space[1]; j++) {
      u[1][j] = 1.0;
    }
  }

  if (rank == size - 1) {
    for (int j = 0; j < my_space[1]; j++) {
      u[u.size() - 2][j] = -1.0;
    }
  }

  // run simulation
  if (create_plots) {
    plot(p, u, v, rank, size, "Initial", downsample_factor, nx, ny, x_len,
         y_len);
  }
  auto t = new Timer();

  solve_cavity_flow(p, u, v, b, my_space_inner, nt, nit, size, rank, dx, dy, dt,
                    rho, nu);

  double elapsed = t->elapsed();
  if (rank == 0) {
    cout << "Ran simulation in " << elapsed << " seconds\n";
  }
  if (create_plots) {
    plot(p, u, v, rank, size, "Solution", downsample_factor, nx, ny, x_len,
         y_len);
  }

  // print result
  double pressure_sum = 0.0;
#pragma omp parallel for reduction(+ : pressure_sum)
  for (int i = my_space_inner[0][0]; i < my_space_inner[0][1]; i++) {
    for (int j = my_space_inner[1][0]; j < my_space_inner[1][1]; j++) {
      pressure_sum += p[i][j];
    }
  }

  double pressure_mean[2] = {pressure_sum, double(my_space[0] * my_space[1])},
         global_pressure_mean[2] = {0.0, 0.0};

  MPI_Reduce(pressure_mean, global_pressure_mean, 2, MPI_DOUBLE, MPI_SUM, 0,
             MPI_COMM_WORLD);

  if (rank == 0) {
    printf("mean pressure: %.4f\n",
           global_pressure_mean[0] / global_pressure_mean[1]);
  }

  MPI_Finalize();
  return 0;
}

void solve_cavity_flow(vector<vector<double>> &p, vector<vector<double>> &u,
                       vector<vector<double>> &v, vector<vector<double>> &b,
                       const int my_space_inner[2][2], const int nt,
                       const int nit, const int world_size, const int my_rank,
                       const double dx, const double dy, const double dt,
                       const double rho, const double nu) {
  auto pn = p;
  auto un = u;
  auto vn = v;
  MPI_Status status;

  for (int i = 0; i < nt; i++) {
    MPI_Barrier(MPI_COMM_WORLD);
    u.swap(un);
    v.swap(vn);

    // compute the portion of p that depends only on u and v
    compute_b(b, un, vn, my_space_inner, dx, dy, dt, rho);
    MPI_Barrier(MPI_COMM_WORLD);
    update_halos(b, my_rank, world_size, status);
    MPI_Barrier(MPI_COMM_WORLD);

    // iteratively solve for p
    for (int p_iter = 0; p_iter < nit; p_iter++) {
      p.swap(pn);
      poisson_kernel(p, pn, b, my_space_inner, dx, dy);
      neumann_bc(p, my_space_inner);
      MPI_Barrier(MPI_COMM_WORLD);
      update_halos(p, my_rank, world_size, status);
      MPI_Barrier(MPI_COMM_WORLD);
    }

    // compute u and v
    compute_u(u, un, vn, p, my_space_inner, dx, dy, dt, rho, nu);
    compute_v(v, un, vn, p, my_space_inner, dx, dy, dt, rho, nu);
    MPI_Barrier(MPI_COMM_WORLD);
    update_halos(u, my_rank, world_size, status);
    update_halos(v, my_rank, world_size, status);
  }
}

void compute_b(vector<vector<double>> &b, vector<vector<double>> const &u,
               vector<vector<double>> const &v, const int my_space_inner[2][2],
               const double dx, const double dy, const double dt,
               const double rho) {
  double du, dv;

#pragma omp parallel for private(du, dv)
  for (int i = my_space_inner[0][0]; i < my_space_inner[0][1]; i++) {
    for (int j = my_space_inner[1][0]; j < my_space_inner[1][1]; j++) {
      du = u[i][j + 1] - u[i][j - 1];
      dv = v[i + 1][j] - v[i - 1][j];

      b[i][j] = rho * ((1.0 / dt) * (du / (2.0 * dx) + dv / (2.0 * dy)) -
                       pow(du / (2.0 * dx), 2) - pow(dv / (2.0 * dy), 2) -
                       2.0 * ((u[i + 1][j] - u[i - 1][j]) / (2.0 * dy) *
                              (v[i][j + 1] - v[i][j - 1]) / (2.0 * dx)));
    }
  }
}

void poisson_kernel(vector<vector<double>> &p, vector<vector<double>> const &pn,
                    vector<vector<double>> const &b,
                    const int my_space_inner[2][2], const double dx,
                    const double dy) {
#pragma omp parallel for
  for (int i = my_space_inner[0][0]; i < my_space_inner[0][1]; i++) {
    for (int j = my_space_inner[1][0]; j < my_space_inner[1][1]; j++) {
      p[i][j] = (dy * dy * (pn[i][j + 1] + pn[i][j - 1]) +
                 dx * dx * (pn[i + 1][j] + pn[i - 1][j]) -
                 dx * dx * dy * dy * b[i][j]) /
                (2.0 * (dx * dx + dy * dy));
    }
  }
}

void neumann_bc(vector<vector<double>> &p, const int my_space_inner[2][2]) {
#pragma omp parallel for
  for (int i = my_space_inner[0][0]; i < my_space_inner[0][1]; i++) {
    p[i][my_space_inner[1][0] - 1] = p[i][my_space_inner[1][0]];
    p[i][my_space_inner[1][1]] = p[i][my_space_inner[1][1] - 1];
  }
}

void compute_u(vector<vector<double>> &u, vector<vector<double>> const &un,
               vector<vector<double>> const &vn,
               vector<vector<double>> const &pn, const int my_space_inner[2][2],
               const double dx, const double dy, const double dt,
               const double rho, const double nu

) {
#pragma omp parallel for
  for (int i = my_space_inner[0][0]; i < my_space_inner[0][1]; i++) {
    for (int j = my_space_inner[1][0]; j < my_space_inner[1][1]; j++) {
      u[i][j] = un[i][j] - un[i][j] * (dt / dx) * (un[i][j] - un[i][j - 1]) -
                vn[i][j] * (dt / dy) * (un[i][j] - un[i - 1][j]) -
                dt / (2.0 * rho * dx) * (pn[i][j + 1] - pn[i][j - 1]) +
                nu * ((dt / (dx * dx)) *
                          (un[i][j + 1] - 2.0 * un[i][j] + un[i][j - 1]) +
                      (dt / (dy * dy)) *
                          (un[i + 1][j] - 2.0 * un[i][j] + un[i - 1][j]));
    }
  }
}

void compute_v(vector<vector<double>> &v, vector<vector<double>> const &un,
               vector<vector<double>> const &vn,
               vector<vector<double>> const &pn, const int my_space_inner[2][2],
               const double dx, const double dy, const double dt,
               const double rho, const double nu) {
#pragma omp parallel for
  for (int i = my_space_inner[0][0]; i < my_space_inner[0][1]; i++) {
    for (int j = my_space_inner[1][0]; j < my_space_inner[1][1]; j++) {
      v[i][j] = vn[i][j] - un[i][j] * (dt / dx) * (vn[i][j] - vn[i][j - 1]) -
                vn[i][j] * (dt / dy) * (vn[i][j] - vn[i - 1][j]) -
                dt / (2.0 * rho * dy) * (pn[i + 1][j] - pn[i - 1][j]) +
                nu * ((dt / (dx * dx)) *
                          (vn[i][j + 1] - 2.0 * vn[i][j] + vn[i][j - 1]) +
                      (dt / (dy * dy)) *
                          (vn[i + 1][j] - 2.0 * vn[i][j] + vn[i - 1][j]));
    }
  }
}

// copy the edges of each sub-array into the neighboring rank's halos
// (don't update rank 0's north halo or rank n's south halo)
void update_halos(vector<vector<double>> &a, int my_rank, int world_size,
                  MPI_Status &status) {
  const int nx = a[0].size();

  // send south edge to southern neighbor
  if (my_rank < world_size - 1) {
    MPI_Send(&a[a.size() - 2][0], nx, MPI_DOUBLE, my_rank + 1, 1,
             MPI_COMM_WORLD);
  }
  // receive north edge from northern neighbor
  if (my_rank > 0) {
    MPI_Recv(&a[0][0], nx, MPI_DOUBLE, my_rank - 1, 1, MPI_COMM_WORLD, &status);
  }

  // send north edge to northern neighbor
  if (my_rank > 0) {
    MPI_Send(&a[1][0], nx, MPI_DOUBLE, my_rank - 1, 2, MPI_COMM_WORLD);
  }
  // receive south edge from southern neighbor
  if (my_rank < world_size - 1) {
    MPI_Recv(&a[a.size() - 1][0], nx, MPI_DOUBLE, my_rank + 1, 2,
             MPI_COMM_WORLD, &status);
  }
}

nsStep11.h

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


#include <algorithm>
#include <array>
#include <cmath>
#include <iostream>
#include <unordered_map>
#include <variant>
#include <vector>

#include "utils.h"
#include <mpi.h>
#include <omp.h>

void solve_cavity_flow(vector<vector<double>> &p, vector<vector<double>> &u,
                       vector<vector<double>> &v, vector<vector<double>> &b,
                       const int my_space_inner[2][2], const int nt,
                       const int nit, const int world_size, const int my_rank,
                       const double dx, const double dy, const double dt,
                       const double rho, const double nu);

void compute_b(vector<vector<double>> &b, vector<vector<double>> const &u,
               vector<vector<double>> const &v, const int my_space_inner[2][2],
               const double dx, const double dy, const double dt,
               const double rho);

void poisson_kernel(vector<vector<double>> &p, vector<vector<double>> const &pn,
                    vector<vector<double>> const &b,
                    const int my_space_inner[2][2], const double dx,
                    const double dy);

void neumann_bc(vector<vector<double>> &p, const int my_space_inner[2][2]);

void compute_u(vector<vector<double>> &u, vector<vector<double>> const &un,
               vector<vector<double>> const &vn, vector<vector<double>> const &p,
               const int my_space_inner[2][2], const double dx, const double dy,
               const double dt, const double rho, const double nu);

void compute_v(vector<vector<double>> &v, vector<vector<double>> const &un,
               vector<vector<double>> const &vn, vector<vector<double>> const &p,
               const int my_space_inner[2][2], const double dx, const double dy,
               const double dt, const double rho, const double nu);

void update_halos(vector<vector<double>> &a, int my_rank, int world_size,
                  MPI_Status &status);

And the other files needed to build and run the C++ code are here:

utils.cpp

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90


#include "utils.h"

void print_downsampled(vector<vector<double>> &a, char name,
                       string const &title, int my_rank, int world_size,
                       int downsample_factor, int nx, int ny) {

  string fileName = title + "_" + name + ".dat";

  // write down-sampled chunk of array to file one rank at a time
  for (int p = 0; p < world_size; p++) {

    if (my_rank == p) {
      ofstream dataFile;
      if (my_rank == 0) {
        dataFile.open(fileName, ios::trunc);
      } else {
        dataFile.open(fileName, ios::app);
      }
      dataFile << setprecision(8);

      for (int i = 1; i < a.size() -1; i++) {
        if (i % downsample_factor == 0) {
          bool first = true;
          for (int j = 0; j < a[0].size(); j++) {
            if (j % downsample_factor == 0) {
              if (first) { first = false; } else { dataFile << " "; }
              dataFile << a[i][j];
            }
          }
          dataFile << "\n";
        }
      }
      dataFile.close();
    }
    MPI_Barrier(MPI_COMM_WORLD);
  }
}

void plot(vector<vector<double>> &p, vector<vector<double>> &u,
          vector<vector<double>> &v, int rank, int size, string const &title,
          int downsample_factor, int nx, int ny, double x_len, double y_len) {

  print_downsampled(p, 'p', title, rank, size, downsample_factor, nx, ny);
  print_downsampled(u, 'u', title, rank, size, downsample_factor, nx, ny);
  print_downsampled(v, 'v', title, rank, size, downsample_factor, nx, ny);

  MPI_Barrier(MPI_COMM_WORLD);

  stringstream plotCmd;
  plotCmd << "python3 ./flowPlot.py " << title << " " << x_len << " " << y_len;
  const string plotCmdString = plotCmd.str();
  const char *plotCmdCString = plotCmdString.c_str();
  std::system(plotCmdCString);
}

void parse_args_with_defaults(
    int argc, char *argv[],
    unordered_map<string, variant<int, double>> &defaults) {
  for (int i = 0; i < argc; i++) {
    string arg(argv[i]);
    if (arg[0] == '-' && arg[1] == '-' && arg.find('=') != string::npos) {
      size_t eq_pos = arg.find('=');
      string argName = arg.substr(2, eq_pos - 2);

      auto search = defaults.find(argName);
      if (search != defaults.end()) {
        if (holds_alternative<int>(search->second)) {
          try {
            search->second = stoi(arg.substr(eq_pos + 1, arg.size()));
          } catch (...) {
            cout << "Warning: invalid value for: '" << argName
                 << "'; keeping default value (" << get<int>(search->second)
                 << ")\n";
          }
        } else {
          try {
            search->second = stod(arg.substr(eq_pos + 1, arg.size()));
          } catch (...) {
            cout << "Warning: invalid value for: '" << argName
                 << "'; keeping default value (" << get<double>(search->second)
                 << ")\n";
          }
        }
      } else {
        cout << "Warning: argument: '" << argName
             << "' doesn't have a default value; ignoring\n";
      }
    }
  }
}

utils.h

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


#include <chrono>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <tuple>
#include <unordered_map>
#include <variant>
#include <vector>

#include <mpi.h>

using namespace std;
using namespace chrono;

void print_downsampled(vector<vector<double>> &a, char name, string const &path,
                       int my_rank, int world_size, int downsample_factor,
                       int nx, int ny);

void plot(vector<vector<double>> &p, vector<vector<double>> &u,
          vector<vector<double>> &v, int rank, int size, string const &title,
          int downsample_factor, int nx, int ny, double x_len, double y_len);

void parse_args_with_defaults(
    int argc, char *argv[],
    unordered_map<string, variant<int, double>> &defaults);

class Timer {
    high_resolution_clock::time_point start;

public:
  Timer() {
    MPI_Barrier(MPI_COMM_WORLD);
    start = high_resolution_clock::now();
  }

  double elapsed() {
    MPI_Barrier(MPI_COMM_WORLD);
    auto stop = high_resolution_clock::now();
    duration<double> runtime = duration_cast<duration<double>>(stop - start);
    return runtime.count();
  }
};

CMakeLists.txt

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


cmake_minimum_required(VERSION 3.1...3.24)
project(
  nsStep11
  VERSION 1.0
  LANGUAGES CXX
)
set(CMAKE_CXX_STANDARD 17)

add_library(ns_utils utils.cpp utils.h)
add_executable(nsStep11 nsStep11.cpp nsStep11.h)

find_package(MPI REQUIRED)
include_directories(SYSTEM ${MPI_INCLUDE_PATH})

find_package(OpenMP)
if(OpenMP_CXX_FOUND)
    target_link_libraries(nsStep11 PUBLIC OpenMP::OpenMP_CXX ns_utils ${MPI_LIBRARIES})
else()
    target_link_libraries(nsStep11 PUBLIC ns_utils ${MPI_LIBRARIES})
endif()

As with our comparison between the non-distributed Poisson solver and its corresponding C++ code, this code is noticeably more verbose than the Chapel version. It weighs in at 280 source lines of code, whereas the Chapel version has only 120 source lines (plotting utilities included). Importantly, the Chapel version also allows for easy modification of the 2D distribution, while major updates would be required to make the arrays distributed across both dimensions in the C++ code.

A particularly striking example of the difference in succinctness is the section used for printing the mean value of the pressure array after the simulation. I include this as a sanity check and to validate the correctness of both codes against the original Python version. The following line of Chapel code:

47

    writef("mean pressure: %.4r\n", + reduce p / p.size);

corresponds to this block of C++ code:

124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141


  double pressure_sum = 0.0;
#pragma omp parallel for reduction(+ : pressure_sum)
  for (int i = my_space_inner[0][0]; i < my_space_inner[0][1]; i++) {
    for (int j = my_space_inner[1][0]; j < my_space_inner[1][1]; j++) {
      pressure_sum += p[i][j];
    }
  }

  double pressure_mean[2] = {pressure_sum, double(my_space[0] * my_space[1])},
         global_pressure_mean[2] = {0.0, 0.0};

  MPI_Reduce(pressure_mean, global_pressure_mean, 2, MPI_DOUBLE, MPI_SUM, 0,
             MPI_COMM_WORLD);

  if (rank == 0) {
    printf("mean pressure: %.4f\n",
           global_pressure_mean[0] / global_pressure_mean[1]);
  }

Chapel is benefiting from several things here:

the + reduce operator makes use of p’s distribution to execute an efficient distributed and parallel sum-reduction behind the scenes
the array is a single cohesive object (rather than a collection of small arrays allocated by each MPI process), so we can do operations like + reduce on the full array
Chapel’s global execution model (where the program starts on locale 0) removes the need for checks like if (rank == 0)

Performance Experiment Setup and Results

Code clarity aside, let’s see how Chapel stacks up against C++ in terms of performance. The following experiments were run on a Cray XC system with the default number of iterations (500 and 50) and a variety of problem sizes. To keep the simulation numerically stable, the ratio of physical length per finite-difference point was held constant at 20 points per unit length.

system/software specs and environment…

Cray XC Supercomputer:

48 core Intel Xeon 8160 CPU (96 threads)
192 GB memory per node
Aries network

Chapel: version 2.2

Chapel environment...

CHPL_TARGET_PLATFORM: cray-xc
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: x86-cascadelake
CHPL_LOCALE_MODEL: flat
CHPL_COMM: ugni
CHPL_TASKS: qthreads
CHPL_LAUNCHER: slurm-srun *
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_NETWORK_ATOMICS: ugni
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: system *  (version 14)
CHPL_AUX_FILESYS: none

MPI: OpenMPI version 3.1

OpenMP: version 4.5

OMP_NUM_THREADS=96
OMP_PROC_BIND=true

GCC: version 12.1

First, looking at strong-scaling, we test the running time of both programs with a range of node counts on two grid sizes: $4096 \times 4096$ and $8192 \times 8192$ . Times in this plot show the average of three runs:

Strong-scaling plots like this one essentially show how much faster you can get a simulation of a particular size to run by throwing more compute resources at it. For both languages, similar speedup characteristics are achieved. For the larger problem size, Chapel comes out ahead[note:The difference in overall performance could be attributed to the difference in memory allocation strategy for the arrays across the two programs. Chapel’s distributed multidimensional arrays lay out each locale’s block of data in a contiguous region in memory (on that locale), whereas the strategy used to create 2D arrays in the C++ code (nested vectors) can allocate each of the inner vectors in its own region on the heap, not necessarily contiguous with the others. This can lead to poorer cache performance (i.e., more misses) when accessing the arrays.] in terms of overall performance, while C++ is slightly ahead for most node counts with the smaller problem size.

Next, keeping the total number of finite-difference points per compute node constant, we have the following weak-scaling results:

Two problem sizes are shown: $1024^2$ elements per node and $2048^2$ elements per node. The weak-scaling analysis gives an indication of how efficiently one can take a small simulation — perhaps one that would run easily on a single compute node — and scale it up to take advantage of more compute resources. For both programming models, the smaller problem size scales quite linearly up to 16 nodes where C++ is ahead in terms of overall performance. For the larger problem size, the scaling is less consistent and performance is generally neck and neck between the two models, except at 16 nodes where Chapel comes out ahead.

Conclusion

Chapel’s powerful distributed programming features allowed us to write a concise distributed and parallel 2D Navier-Stokes cavity flow solver that runs efficiently on a supercomputer. The program was developed and tested using a small problem size on a laptop, and then scaled up to run across many nodes without making code changes. Additionally, the vast majority of the code was written without needing to consider any of the technical details of distribution or parallelization apart from the use of the stencil distribution and its update calls. Through the above performance comparison, we saw that Chapel was able to match the performance of an industry-standard software stack for doing distributed scientific simulations. Not only that, but the Chapel code was about half the length, and arguably much more readable.

The Chapel Team would like to express its sincere gratitude towards the Barba Group for creating and maintaining the CFD Python tutorials.

Chapel Language Blog