Data Parallelism

Chapel provides two explicit data-parallel constructs (the forall-statement and the forall-expression) and several idioms that support data parallelism implicitly (whole-array assignment, function and operator promotion, reductions, and scans).

This chapter details data parallelism as follows:

Data-parallel constructs may result in accesses to the same variable from different tasks, possibly due to aliasing using ref argument intents or forall intents, among others. Such accesses are subject to the Memory Consistency Model (Memory Consistency Model).

The Forall Statement

The forall statement is a concurrent variant of the for statement described in The For Loop.

Syntax

The syntax of the forall statement is given by

forall-statement:
  'forall' index-var-declaration 'in' iteratable-expression task-intent-clause[OPT] 'do' statement
  'forall' index-var-declaration 'in' iteratable-expression task-intent-clause[OPT] block-statement
  'forall' iteratable-expression task-intent-clause[OPT] 'do' statement
  'forall' iteratable-expression task-intent-clause[OPT] block-statement
  [ index-var-declaration 'in' iteratable-expression task-intent-clause[OPT] ] statement
  [ iteratable-expression task-intent-clause[OPT] ] statement

As with the for statement, the indices may be omitted if they are unnecessary and the do keyword may be omitted before a block statement.

The square bracketed form will resort to order-independent iteration (i.e. foreach) when iteratable-expression does not support parallel iteration. The forall form will result in an error when parallel iteration is not available.

The handling of the outer variables within the forall statement and the role of task-intent-clause are defined in Forall Intents.

Execution and Serializability

The forall statement evaluates the loop body once for each element yielded by the iteratable-expression. Each instance of the forall loop’s body may be executed concurrently with the others, but this is not guaranteed. In particular, the loop must be serializable. Details regarding concurrency and iterator implementation are described in Parallel Iterators.

This differs from the semantics of the coforall loop, discussed in The Coforall Loop, where each iteration is guaranteed to run using a distinct task. The coforall loop thus has potentially higher overhead than a forall loop with the same number of iterations, but in cases where concurrency is required for correctness, it is essential.

In practice, the number of tasks that will be used to evaluate a forall loop is determined by the object or iterator that is leading the execution of the loop, as is the mapping of iterations to tasks.

This concept will be formalized in future drafts of the Chapel specification. For now, the primer on parallel iterators provides a brief introduction. Please also refer to User-Defined Parallel Zippered Iterators in Chapel, published in the PGAS 2011 workshop.

Control continues with the statement following the forall loop only after every iteration has been completely evaluated. At this point, all data accesses within the body of the forall loop will be guaranteed to be completed.

A return statement may not be lexically enclosed in a forall statement. A yield statement may only be lexically enclosed in a forall statement that is within a parallel iterator Parallel Iterators. A break statement may not be used to exit a forall statement. A continue statement skips the rest of the current iteration of the forall loop.

Example (forallStmt.chpl).

In the code

forall i in 1..N with (ref a) do
  a(i) = b(i);

the user has stated that the element-wise assignments can execute concurrently. This loop may be executed serially with a single task, or by using a distinct task for every iteration, or by using a number of tasks where each task executes a number of iterations. This loop can also be written as

[i in 1..N with (ref a)] a(i) = b(i);

Zippered Iteration

Zippered iteration has the same semantics as described in Zippered Iteration and Parallel Iterators for parallel iteration.

The Forall Expression

The forall expression is a concurrent variant of the for expression described in For Expressions.

Syntax

The syntax of a forall expression is given by

forall-expression:
  'forall' index-var-declaration 'in' iteratable-expression task-intent-clause[OPT] 'do' expression
  'forall' iteratable-expression task-intent-clause[OPT] 'do' expression
  [ index-var-declaration 'in' iteratable-expression task-intent-clause[OPT] ] expression
  [ iteratable-expression task-intent-clause[OPT] ] expression

As with the for expression, the indices may be omitted if they are unnecessary. The do keyword is always required in the keyword-based notation.

As with the forall statement, the square bracketed form will resort to order-independent iteration (i.e. foreach) when iteratable-expression does not support parallel iteration. The forall form will result in an error when parallel iteration is not available.

The handling of the outer variables within the forall expression and the role of task-intent-clause are defined in Forall Intents.

Execution

A forall expression is an iterator that executes a forall loop (The Forall Statement), evaluates the body expression on each iteration of the loop, and yields each resulting value.

When a forall expression is used to initialize a variable, such as

var X = forall iterableExpression() do computeValue();

the variable will be inferred to have an array type. The array’s domain is defined by the iterable-expression following the same rules as for promotion, both in the regular case Promotion and in the zippered case Zippered Promotion.

Example (forallExpr.chpl).

The code

writeln(+ reduce [i in 1..10] i**2);

applies a reduction to a forall-expression that evaluates the square of the indices in the range 1..10.

The forall expression follows the semantics of the forall statement as described in Execution and Serializability.

Zippered Iteration

Forall expression also support zippered iteration semantics as described in Zippered Iteration and Parallel Iterators for parallel iteration.

Filtering Predicates in Forall Expressions

A filtering predicate is an if expression that is immediately enclosed by a forall expression and does not have an else clause. Such an if expression filters the iterations of the forall expression. The iterations for which the condition does not hold are not reflected in the result of the forall expression.

When a forall expression with a filtering predicate is captured into a variable, the resulting array has a 0-based one-dimensional domain.

Example (forallFilter.chpl).

The following expression returns every other element starting with the first:

[i in 1..s.size] if i % 2 == 1 then s(i)

Forall Intents

If a variable is referenced within the lexical scope of a forall statement or expression and is declared outside that forall construct, it is subject to forall intents, analogously to task intents for task-parallel constructs (see Task Intents). That is, the outer variable is considered to be passed as an actual argument to an implicit formal of the iterator leading the execution of the loop. From there, it is passed down to each task created by that iterator, if any, as an actual argument to an implicit formal of the corresponding task function. A top-level task passes it down recursively to its child tasks, if any. All references to the variable within the forall construct implicitly refer to a shadow variable, i.e. the corresponding formal argument of the task function or the leading iterator.

When the forall construct is inside a method on a record, accesses a field of this, and does not contain an explicit forall intent on this (see below), the field itself is treated as an outer variable. That is, it is subject to forall intents and all references to this field within the forall construct implicitly refer to the corresponding shadow variable.

Each formal argument of a task function or iterator has the default intent by default. See also The Default Intent. Note that the default intent allows the compiler to assume that the value will not be concurrently modified, except for values of sync or atomic type.

For variables of primitive, enum, and class types, this has the effect of capturing the value of the variable at task creation time. Within the lexical scope of the forall construct, the variable name references the captured value instead of the original value.

A formal can be given another intent explicitly by listing it with that intent in the optional task-intent-clause. For example, for variables of most types, the ref intent allows the body of the forall loop to modify the corresponding original variable or to read its updated value after concurrent modifications. The in intent is an alternative way to obtain task-private variables (see Task-Private Variables).

A reduce forall intent can be used to reduce values across iterations of a forall loop. While it is similar to the reduce task intent (see Task Intents), there is a difference in how values are combined at the end of a task. With a reduce forall intent, each child task combines its accumulated value into its parent task rather than into an outer variable. The reduce= operator accumulates its right-hand side values computed for all iterations executed by a given task into the same shadow variable for that task.

Rationale.

A forall statement or expression may create tasks in its implementation. Forall intents affect those tasks in the same way that task intents Task Intents affect the behavior of a task construct such as a coforall loop.

Task-Private Variables

A task-private variable declared in a forall loop results in a separate shadow variable in each task created by the forall loop’s parallel iterator, as well as a “top-level” shadow variable created at the top level of the parallel iterator itself. In contrast to regular forall intents Forall Intents, these shadow variables are unrelated to outer variables of the same name, if any.

A given shadow variable is created at the start and destroyed at the end of its task. Within the lexical scope of the body of the forall statement or expression, the variable name refers to the shadow variable created in the task that executed the current yield statement.

The “top-level” shadow variable is created at the start and destroyed at the end of the parallel iterator. It is referenced in those iterations of the forall loop that are due to “top-level” yields, i.e. yields that are outside any of the task constructs that the iterator may have.

The syntax of a task-private variable declaration in a forall statement’s with-clause is:

task-private-var-decl:
  task-private-var-kind identifier type-part[OPT] initialization-part[OPT]

task-private-var-kind:
  'const'
  'var'
  'ref'

The declaration of a const or var task-private variable must have at least one of type-part and initialization-part. A ref task-private variable must have initialization-part and cannot have type-part. A ref shadow variable is a reference to the initialization-part as calculated at the start of the corresponding task or the iterator. ref shadow variables are never destroyed.

Example (task-private-variable.chpl).

In the following example, the writeln() statement will observe the first shadow variable 4 times: twice each for the yields “before coforall” and “after coforall”. An additional shadow variable will be created and observed twice for each of the three coforall tasks.

var cnt: atomic int;                     // count our shadow variables
record R { var id = cnt.fetchAdd(1); }

iter myIter() { yield ""; }              // serial iterator, unused

iter myIter(param tag) where tag == iterKind.standalone {
  for 1..2 do
    yield "before coforall";             // shadow var 0 ("top-level")
  coforall 1..3 do
    for 1..2 do
      yield "inside coforall";           // shadow vars 1..3
  for 1..2 do
    yield "after coforall";              // shadow var 0, again
}

forall str in myIter()
  with (var tpv: R)                      // declare a task-private variable
do
  writeln("shadow var: ", tpv.id, "  yield: ", str);

Promotion

A function that expects one or more scalar arguments but is called with one or more arrays, domains, ranges, or iterators is promoted if the element types of the arrays, the index types of the domains and/or ranges, or the yielded types of the iterators can be resolved to the type of the argument. The rules of when an overloaded function can be promoted are discussed in Function Resolution.

Functions that can be promoted include procedures, operators, casts, and methods. Also note that since class and record field access is performed with getter methods (Field Getter Methods), field access can also be promoted.

If the original function returns a value or a reference, the corresponding promoted expression is an iterator yielding each computed value or reference.

When a promoted expression is used to initialize a variable, such as var X = A.x; in the above example, the variable’s type will be inferred to be an array. The array’s domain is defined by the expression that causes promotion:

input expression

resulting array’s domain

array

that array’s domain

domain

that domain

range

one-dimensional domain built from that range

iterator

0-based one-dimensional domain

Note

Future

We would like to allow the iterator author to specify the shape of the iterator, i.e. the domain of the array that would capture the result of the corresponding promoted expression, such as

var myArray = myScalarFunction(myIterator());

This will be helpful, for example, when the iterator yields one value per an array or domain element that it iterates over internally.

Example (promotion.chpl).

Given the array

var A: [1..5] int = [i in 1..5] i;

and the function

proc square(x: int) do return x**2;

then the call square(A) results in the promotion of the square function over the values in the array A. The result is an iterator that returns the values 1, 4, 9, 16, and 25.

Example (field-promotion.chpl).

Given an array of points, such as A defined below:

record Point {
  var x: real;
  var y: real;
}
var A: [1..5] Point = [i in 1..5] new Point(x=i, y=i);

the following statement will create a new array consisting of the x field value for each value in A:

var X = A.x;

and the following call will set the y field values for each element in A to 1.0:

A.y = 1.0;

Default Arguments

When a call is promoted and that call relied upon default arguments (Default Values), the default argument expression can be evaluated many times. For example:

Example (promotes-default.chpl).

var counter: atomic int;

proc nextCounterValue():int {
  var i = counter.fetchAdd(1);
  return i;
}

proc assignCounter(ref x:int, counter=nextCounterValue()) {
  x = counter;
}

Here the function assignCounter has a default argument providing the next value from an atomic counter as the value to set.

var A: [1..5] int;
assignCounter(A);

The assignCounter call uses both the default argument for counter as well as promotion. When these features are combined, the default argument will be evaluated once per promoted element. As a result, after this command, A will contain the elements 0 1 2 3 4 in some order.

Zippered Promotion

Promotion also supports zippered iteration semantics as described in Zippered Iteration and Parallel Iterators for parallel iteration.

Consider a function f with formal arguments s1, s2, … that are promoted and formal arguments a1, a2, … that are not promoted. The call

f(s1, s2, ..., a1, a2, ...)

is equivalent to

[(e1, e2, ...) in zip(s1, s2, ...)] f(e1, e2, ..., a1, a2, ...)

The usual constraints of zippered iteration apply to zippered promotion, so the promoted actuals must have the same shape.

Formal arguments that are not promoted are evaluated once and stored in a temporary variable. If formal a1 is an expression, then the call

f(s1, s2, ..., a1, a2, ...)

is equivalent to

var tmp = a1;
[(e1, e2, ...) in zip(s1, s2, ...)] f(e1, e2, ..., tmp, a2, ...)

In this instance, if formal a1 is an expression that has side effects (such as printing), those side effects will only occur once.

A zippered promotion can be captured in a variable, such as var X = f(s1, s2, ..., a1, a2, ...); using the above example. If so, the domain of the resulting array is defined by the first argument that causes promotion. The rules are the same as in the non-zippered case.

Example (zipper-promotion.chpl).

Given a function defined as

proc foo(i: int, j: int) {
  return (i,j);
}

and a call to this function written

writeln(foo(1..3, 4..6));

then the output is

(1, 4) (2, 5) (3, 6)

Whole Array Operations and Evaluation Order

Whole array operations are a form of promotion as applied to operators rather than functions.

Whole array assignment is one example. It is is implicitly parallel. The array assignment statement:

LHS = RHS;

is equivalent to

forall (e1,e2) in zip(LHS,RHS) do
  e1 = e2;

The semantics of whole array assignment and promotion are different from most array programming languages. Specifically, the compiler does not insert array temporaries for such operations if any of the right-hand side array expressions alias the left-hand side expression.

Example.

If A is an array declared over the indices 1..5, then the following codes are not equivalent:

A[2..4] = A[1..3] + A[3..5];

and

var T = A[1..3] + A[3..5];
A[2..4] = T;

This follows because, in the former code, some of the new values that are assigned to A may be read to compute the sum depending on the number of tasks used to implement the data parallel statement.

Reductions and Scans

Chapel provides reduction and scan expressions that apply operators to aggregate expressions in stylized ways. Reduction expressions collapse the aggregate’s values down to a summary value. Scan expressions compute an aggregate of results where each result value stores the result of a reduction applied to all of the elements in the aggregate up to that expression. Chapel provides a number of predefined reduction and scan operators, and also supports a mechanism for the user to define additional reductions and scans (User-Defined Reductions and Scans).

Reduction Expressions

A reduction expression applies a reduction operator to an aggregate expression, collapsing the aggregate’s dimensions down into a result value (typically a scalar or summary expression that is independent of the input aggregate’s size). For example, a sum reduction computes the sum of all the elements in the input aggregate expression.

The syntax for a reduction expression is given by:

reduce-expression:
  reduce-scan-operator 'reduce' iteratable-expression
  class-type 'reduce' iteratable-expression

reduce-scan-operator: one of
  + * && || & | ^ 'min' 'max' 'minmax' 'minloc' 'maxloc'

Chapel’s predefined reduction operators are defined by reduce-scan-operator above. In order, they are: sum, product, logical-and, logical-or, bitwise-and, bitwise-or, bitwise-exclusive-or, minimum, maximum, minimum-and-maximum, minimum-with-location, and maximum-with-location. The minimum reduction returns the minimum value as defined by the < operator. The maximum reduction returns the maximum value as defined by the > operator. The minimum-and-maximum reduction returns a tuple with the first component being the result of the minimum reduction and the second component being the result of the maximum reduction. The minimum-with-location reduction returns the lowest index position with the minimum value (as defined by the < operator). The maximum-with-location reduction returns the lowest index position with the maximum value (as defined by the > operator). When a minimum, maximum, minimum-and-maximum, minimum-with-location, or maximum-with-location reduction encounters a NaN, the result is or contains a NaN.

The expression on the right-hand side of the reduce keyword can be of any type that can be iterated over, provided the reduction operator can be applied to the values yielded by the iteration. For example, the bitwise-and operator can be applied to arrays of boolean or integral types to compute the bitwise-and of all the values in the array.

For the minimum-with-location and maximum-with-location reductions, the argument on the right-hand side of the reduce keyword must be a 2-tuple. Its first component is the collection of values for which the minimum/maximum value is to be computed. The second argument component is a collection of indices with the same size and shape that provides names for the locations of the values in the first component. The reduction returns a tuple containing the minimum/maximum value in the first argument component and the value at the corresponding location in the second argument component.

Example (reduce-loc.chpl).

The first line below computes the smallest element in an array A as well as its index, storing the results in minA and minALoc, respectively. It then computes the largest element in a forall expression making calls to a function foo(), storing the value and its number in maxVal and maxValNum.

var (minA, minALoc) = minloc reduce zip(A, A.domain);
var (maxVal, maxValNum) = maxloc reduce zip([i in 1..n] foo(i), 1..n);

User-defined reductions are specified by preceding the keyword reduce by the class type that implements the reduction interface as described in User-Defined Reductions and Scans.

Scan Expressions

A scan expression applies a scan operator to an aggregate expression, resulting in an aggregate expression of the same size and shape. The output values represent the result of the operator applied to all elements up to and including the corresponding element in the input.

The syntax for a scan expression is given by:

scan-expression:
  reduce-scan-operator 'scan' iteratable-expression
  class-type 'scan' iteratable-expression

The predefined scans are defined by reduce-scan-operator. These are identical to the predefined reductions and are described in Reduction Expressions.

The expression on the right-hand side of the scan can be of any type that can be iterated over and to which the operator can be applied.

Example.

Given an array

var A: [1..3] int = 1;

that is initialized such that each element contains one, then the code

writeln(+ scan A);

outputs the results of scanning the array with the sum operator. The output is

1 2 3

User-defined scans are specified by preceding the keyword scan by the class type that implements the scan interface as described in User-Defined Reductions and Scans.

Configuration Constants for Default Data Parallelism

The following configuration constants are provided to control the degree of data parallelism over ranges, default domains, and default arrays:

Config Const

Type

Default

dataParTasksPerLocale

int

top level .maxTaskPar   (see Locale Methods)

dataParIgnoreRunningTasks

bool

true

dataParMinGranularity

int

1

The configuration constant dataParTasksPerLocale specifies the number of tasks to use when executing a forall loop over a range, default domain, or default array. The actual number of tasks may be fewer depending on the other two configuration constants. A value of zero results in using the default value.

The configuration constant dataParIgnoreRunningTasks, when true, has no effect on the number of tasks to use to execute the forall loop. When false, the number of tasks per locale is decreased by the number of tasks that are already running on the locale, with a minimum value of one.

The configuration constant dataParMinGranularity specifies the minimum number of iterations per task created. The number of tasks is decreased so that the number of iterations per task is never less than the specified value.

For distributed domains and arrays that have these same configuration constants (e.g., Block and Cyclic distributions), these same module level configuration constants are used to specify their default behavior within each locale.