Memory Consistency Model

In this section, we describe Chapel’s memory consistency model. The model is based on sequential consistency for data-race-free programs as adopted by C11, C++11, Java, UPC, and Fortran 2008.

Sequential consistency (SC) means that all Chapel tasks agree on the interleaving of memory operations and this interleaving results in an order that is consistent with the order of operations in the program source code. Conflicting memory operations, i.e., operations to the same variable, or memory location, and one of which is a write, form a data race if they are from different Chapel tasks and can be executed concurrently. Accesses to the same variable from different tasks can result from the tasks referencing the same variable directly – or indirectly via aliases. Aliases arise, for example, when using ref variables, argument intents, return intents, task intents and forall intents.

Any Chapel program with a data race is not a valid program, and an implementation cannot be relied upon to produce consistent behavior. Valid Chapel programs will use synchronization constructs such as sync or atomic variables or higher-level constructs based on these to enforce ordering for conflicting memory operations.

The following design principles were used in developing Chapel’s memory consistency model:

  1. Sequential programs have program order semantics. Programs that are completely sequential cannot have data races and should appear to execute as though each statement was executed one at a time and in the expected order.

  2. Chapel’s fork-join constructs introduce additional order dependencies. Operations within a task cannot behave as though they started before the task started. Similarly, all operations in a task must appear to be completed to a parent task when the parent task joins with that task.

  3. Multi-locale programs have the same memory consistency model as single-locale programs. The Chapel language seeks to allow a single description of an algorithm to work with different data distributions. A result of this property is that an expression of a program must be correct whether it is working on local or distributed data.

  4. Chapel’s memory model should be as relaxed as possible while still consistent with these design principles. In particular, making all operations sequentially consistent is not likely to enable good performance. At the same time, sequential consistency should be available to programmers when requested.

See A Primer on Memory Consistency and Cache Coherence by Sorin, et al. for more background information on memory consistency models. This chapter will proceed in a manner inspired by the \(XC\) memory model described there.

Sequential Consistency for Data-Race-Free Programs

Sequential consistency for data-race-free programs is described in terms of two orders: program order and memory order. The program order \(<_p\) is a partial order describing serial or fork-join parallelism dependencies between variable reads and writes. The memory order \(<_m\) is a total order that describes the semantics of synchronizing memory operations (via atomic or sync variables) with sequential consistency. Non-SC atomic operations (described in Non-Sequentially Consistent Atomic Operations) do not create this total order.

Note that sync variables have memory consistency behavior equivalent to a sequence of SC operations on atomic variables. Thus for the remainder of the chapter, we will primarily focus on operations on atomic variables.

We will use the following notation:

  • \(L(a)\) indicates a load from a variable at address \(a\). \(a\) could refer to local or remote memory.

  • \(S(a)\) indicates a store to a variable at address \(a\). \(a\) could refer to local or remote memory.

  • \(A_{sc}(a)\) indicates an atomic operation on a variable at address \(a\) with sequential consistency. The variable at address \(a\) could refer to local or remote memory. Atomic operations must be completed as a single operation (i.e. atomically), and so it is not possible to observe an intermediate state from an atomic operation under any circumstances.

  • \(A_r(a,o)\) indicates an atomic operation on a variable at address \(a\) with ordering constraint \(o\), where \(o\) can be one of relaxed, acquire, or release (see Non-Sequentially Consistent Atomic Operations). As with \(A_{sc}(a)\), relaxed atomic operations must be completed as a single operation.

  • \(L(a)\), \(S(a)\), \(A_{sc}(a)\), and \(A_r(a,o)\) are also called memory operations

  • \(X <_p Y\) indicates that \(X\) precedes \(Y\) in program order

  • \(X <_m Y\) indicates that \(X\) precedes \(Y\) in memory order

  • t = begin{X} starts a new task named \(t\) to execute \(X\)

  • waitFor($t_1$..$t_n$) waits for tasks \(t_1..t_n\) to complete

  • on(L) migrates the running task to locale \(L\). Note that while the on statement may change the locale on which the current task is running, it has no impact on the memory consistency requirements.

For the purposes of describing this memory model, it is assumed that Chapel programs will be translated into sequences of memory operations, begin statements, and waitFor statements. The translation of a Chapel program into a sequence of memory operations must preserve sequential program semantics. That is, if we have a snippet of a Chapel program without task operations, such as X; Y;, the statements \(X\) and \(Y\) will be converted into a sequence of load, store, and atomic operations in a manner that preserves the behavior of this serial portion of the program. That is, \(X=x_1,x_2,...\) and \(Y=y_1,y_2,...\) where \(x_i\) and \(y_j\) are each a sequence of load, store, or atomic operations and we have \(x_i <_p y_j\).

Likewise, for the purposes of this memory model, Chapel’s parallelism keywords are viewed as a sequence of operations including the primitives of starting a task (begin) and waiting for some number of tasks (waitFor($t_1$..$t_n$)). In particular:

  • forall (including promotion) creates some number of tasks \(m\) to execute the \(n\) iterations of the loop

    ($t_i$ = begin{some-loop-bodies} for each task \(i=1\)..\(m\)) and waits for them to complete (waitFor(t_1..t_m)). The number of tasks \(m\) is defined by the implementation of the parallel iterator (See Iterators for details on iterators).

  • coforall creates one task per loop iteration (t_i = begin{loop-body} for all loop iterations \(i=1..n\)) and then waits for them all to complete (waitFor(t_1..t_n)).

  • cobegin creates one task per enclosed statement (t_i = begin{X_i} for statements \(X_1\)..\(X_n\)) and then waits for them all to complete (waitFor(t_1..t_n)).

  • begin creates a task to execute the enclosed statement (t = begin{X}). The sync statement waits for all tasks \(t_i\) created by a begin statement in the dynamic scope of the sync statement that are not within other, nested sync statements (waitFor(t_1..t_n) for all \(n\) such tasks).

Program Order

Task creation and task waiting create a conceptual tree of program statements. The task bodies, task creation, and task wait operations create a partial order \(<_p\) of program statements. For the purposes of this section, the statements in the body of each Chapel task will be implemented in terms of load, store, and atomic operations.

  • If we have a program snippet without tasks, such as X; Y;, where \(X\) and \(Y\) are memory operations, then \(X <_p Y\).

  • The program X; begin{Y}; Z; implies \(X\) \(<_p\) \(Y\). However, there is no particular relationship between \(Y\) and \(Z\) in program order.

  • The program t = begin{Y}; waitFor(t); Z; implies \(Y\) \(<_p\) \(Z\).

  • \(X\) \(<_p\) \(Y\) and \(Y\) \(<_p\) \(Z\) imply \(X\) \(<_p\) \(Z\).

Memory Order

The memory order \(<_m\) of SC atomic operations in a given task respects program order as follows:

  • If \(A_{sc}(a)<_pA_{sc}(b)\) then \(A_{sc}(a)<_mA_{sc}(b)\)

Every SC atomic operation gets its value from the last SC atomic operation before it to the same address in the total order \(<_m\):

  • Value of \(A_{sc}(a)\) = Value of \(max_{<_m} ( A_{sc}'(a)|A_{sc}'(a) <_m A_{sc}(a) )\)

For data-race-free programs, every load gets its value from the last store before it to the same address in the total order \(<_m\):

  • Value of \(L(a)\) = Value of \(max_{<_m}\) \(( S(a)|S(a)\) \(<_m\) \(L(a)\) or \(S(a)\) \(<_p\) \(L(a) )\)

For data-race-free programs, loads and stores are ordered with SC atomics. That is, loads and stores for a given task are in total order \(<_m\) respecting the following rules which preserve the order of loads and stores relative to SC atomic operations:

  • If \(L(a)<_pA_{sc}(b)\) then \(L(a)<_mA_{sc}(b)\)

  • If \(S(a)<_pA_{sc}(b)\) then \(S(a)<_mA_{sc}(b)\)

  • If \(A_{sc}(a)<_pL(b)\) then \(A_{sc}(a)<_mL(b)\)

  • If \(A_{sc}(a)<_pS(b)\) then \(A_{sc}(a)<_mS(b)\)

For data-race-free programs, loads and stores preserve sequential program behavior. That is, loads and stores to the same address in a given task are in the order \(<_m\) respecting the following rules which preserve sequential program behavior:

  • If \(L(a) <_p L'(a)\) then \(L(a) <_m L'(a)\)

  • If \(L(a) <_p S(a)\) then \(L(a) <_m S(a)\)

  • If \(S(a) <_p S'(a)\) then \(S(a) <_m S'(a)\)

Non-Sequentially Consistent Atomic Operations

Sequential consistency for atomic operations can be a performance bottleneck under some circumstances. Chapel provides non-SC atomic operations to help alleviate such situations. Such uses of atomic operations must be done with care and should generally not be used to synchronize tasks.

Non-SC atomic operations are specified by providing a memory order argument to the atomic operations. See the Atomic Variables section for more information on the memory order types.

Open issue.

This section describes memoryOrder.relaxed but does not yet describe memoryOrder.acquire, memoryOrder.release, or memoryOrder.acqRel orderings. The intention is that the behavior of these orderings match the C and C++ definitions.

Relaxed Atomic Operations

Although Chapel’s relaxed atomic operations (memoryOrder.relaxed) do not complete in a total order by themselves and might contribute to non-deterministic programs, relaxed atomic operations cannot contribute to a data race that would prevent sequential consistency.

When relaxed atomics are used only for atomicity and not as part of synchronizing tasks, their effect can be understood in the memory consistency model described in Sequential Consistency for Data-Race-Free Programs by treating them as ordinary loads or stores with two exceptions:

  • Atomic operations (including relaxed atomic operations) cannot create data races.

  • All atomic operations (including relaxed atomic operations) will eventually be visible to all other threads. This property is not true for normal loads and stores.

Unordered Memory Operations

Open issue.

Syntax for unordered operations has not yet been finalized.

Open issue.

Should Chapel provide a memory fence that only completes unordered operations started by the current task?

Open issue.

Should unordered operations on a particular memory address always wait for the address to be computed?

Open issue.

Does starting a task or joining with a task necessarily wait for unordered operations to complete?

Rather than issuing normal loads and stores to read or write local or remote memory, a Chapel program can use unordered loads and stores when preserving sequential program behavior is not important. The following notation for unordered memory operations will be used in this section:

  • \(UL(a)\) indicates an unordered load from a variable at address \(a\). \(a\) could point to local or remote memory.

  • \(US(a)\) indicates an unordered store to a variable at address \(a\). Again, \(a\) could point to local or remote memory.

The unordered loads and stores \(UL(a)\) and \(US(a)\) respect fences but not program order. As in Section Memory Order, unordered loads and stores are ordered with SC atomics. That is, unordered loads and stores for a given task are in total order \(<_m\) respecting the following rules which preserve the order of unordered loads and stores relative to SC atomic operations:

  • If \(UL(a)<_pA_{sc}(b)\) then \(UL(a)<_mA_{sc}(b)\)

  • If \(US(a)<_pA_{sc}(b)\) then \(US(a)<_mA_{sc}(b)\)

  • If \(A_{sc}(a)<_pUL(b)\) then \(A_{sc}(a)<_mUL(b)\)

  • If \(A_{sc}(a)<_pUS(b)\) then \(A_{sc}(a)<_mUS(b)\)

Unordered loads and stores do not preserve sequential program behavior.

Unordered Memory Operations Examples

Unordered operations should be thought of as happening in a way that overlaps with the program task. Unordered operations started in different program statements can happen in any order unless an SC atomic operation orders them.

Since unordered operations started by a single task can happen in any order, totally sequential programs can have a data race when using unordered operations. This follows from our original definition of data race.

var x: int = 0;
unordered_store(x, 10);
unordered_store(x, 20);
writeln(x);

The value of x at the end of this program could be 0, 10, or 20. As a result, programs using unordered loads and stores are not sequentially consistent unless the program can guarantee that unordered operations can never operate on the same memory at the same time when one of them is a store. In particular, the following are safe:

  • Unordered stores to disjoint regions of memory.

  • Unordered loads from potentially overlapping regions of memory when no store could overlap with the loads.

  • Unordered loads and stores to the same memory location when these are always separated by an SC atomic operation.

Unordered loads and stores are available as a performance optimization. For example, a program computing a permutation on an array might want to move data between two arrays without requiring any ordering:

const n = 10;
// P is a permutation on 1..n, in this case reversing its input
var P = for i in 1..n by -1 do i;
// A is an array to permute
var A = for i in 1..n do i;
// Compute, in B, the permutation applied to A
var B:[1..n] int;

for i in 1..n {
  unordered_store(B[P[i]], A[i]);
}

Examples

Example.

In this example, a synchronization variable is used to (a) ensure that all writes to an array of unsynchronized variables are complete, (b) signal that fact to a second task, and (c) pass along the number of values that are valid for reading.

The program

var A: [1..100] real;
var done: sync int;           // initially empty
cobegin {
  { // Reader task
    const numToRead = done;   // block until writes are complete
    for i in 1..numToRead do
      writeln("A[", i, "] = ", A[i]);
  }
  {  // Writer task
    const numToWrite = 14;     // an arbitrary number
    for i in 1..numToWrite do
      A[i] = i/10.0;
    done = numToWrite;        // fence writes to A and signal done
  }
}

produces the output

A[1] = 0.1
A[2] = 0.2
A[3] = 0.3
A[4] = 0.4
A[5] = 0.5
A[6] = 0.6
A[7] = 0.7
A[8] = 0.8
A[9] = 0.9
A[10] = 1.0
A[11] = 1.1
A[12] = 1.2
A[13] = 1.3
A[14] = 1.4

Example (syncSpinWait.chpl).

One consequence of Chapel’s memory consistency model is that a task cannot spin-wait on a normal variable waiting for another task to write to that variable. The behavior of the following code is undefined:

var x: int;
cobegin with (ref x) {
  while x != 1 do ;  // INCORRECT spin wait
  x = 1;
}

In contrast, spinning on a synchronization variable has well-defined behavior:

var x: sync int;
cobegin {
  while x.readXX() != 1 do ;  // spin wait
  x.writeXF(1);
}

In this code, the first statement in the cobegin statement executes a loop until the variable is set to one. The second statement in the cobegin statement sets the variable to one. Neither of these statements block.

Example (atomicSpinWait.chpl).

Atomic variables provide an alternative means to spin-wait. For example:

var x: atomic int;
cobegin {
  while x.read() != 1 do ;  // spin wait - monopolizes processor
  x.write(1);
}

Example (atomicWaitFor.chpl).

The main drawback of the above example is that it prevents the thread executing the spin wait from doing other useful work. Atomic variables include a waitFor method that will block the calling thread until a read of the atomic value matches a particular value. In contrast to the spin wait loop above, waitFor will allow other tasks to be scheduled. For example:

var x: atomic int;
cobegin {
  x.waitFor(1);
  x.write(1);
}