.. default-domain:: chpl .. _primers-loops: Loops ===== `View loops.chpl on GitHub `_ This primer teaches the loop forms supported by Chapel, both serial and parallel. .. code-block:: chapel use IO; // enable access to the readln() call that we use below Like most imperative programming languages, Chapel supports loops. A loop is designed to run the statement or statements making up its body a number of times, where that number could be one or even zero. Chapel supports traditional serial loops, which execute the loop's iterations one after another. It also supports parallel loops in which the loop's iterations may run simultaneously using hardware parallelism available in the target system. This primer is designed to introduce these loop forms and to provide guidance as to when each might be appropriate. First, let's start with a quick survey of the loop forms before going through them in detail. Chapel supports: * **while loops:** :ref:`for arbitrary serial loops ` .. code-block:: chapel while cond do ...body... * **for loops:** :ref:`for iterator-driven serial loops ` .. code-block:: chapel for i in 1..n do ...body... * **foreach loops:** :ref:`for parallel loops implemented using hardware parallelism (e.g., vectorization) ` .. code-block:: chapel foreach i in 1..n do ...body... * **forall loops:** :ref:`for parallel loops implemented using software parallelism (e.g., threads) ` .. code-block:: chapel forall i in 1..n do ...body... * **square-bracket loops:** :ref:`as a shorthand for forall- or foreach-loops ` .. code-block:: chapel [i in 1..n] ...body... * **coforall loops:** :ref:`for concurrent loops in which each iteration is executed by a distinct task ` .. code-block:: chapel coforall i in 1..n do ...body... In addition to going through these loop forms in detail, this primer also covers :ref:`loops over arrays and domains `, :ref:`zippered iteration `, :ref:`fully unrolled for-loops `, :ref:`promotion `, :ref:`race conditions `, :ref:`loop nests `, :ref:`when to use which loop form `, and :ref:`a common performance mistake `. Serial Loops ------------ .. _While Loops: While Loops ~~~~~~~~~~~ We'll start with Chapel's *while-loops*, which execute as long as a boolean condition remains true. While loops come in two forms, the ``while...`` loop and the ``do...while`` loop. These are similar to their counterparts in other languages. Here is a ``while...`` loop that will print and double an integer ``i`` until its value exceeds 100: .. code-block:: chapel var i = 1; while i < 100 { writeln("Within the first while-loop, i is ", i); i *= 2; } In the event that the loop body consists of only a single statement, you can use the ``do`` keyword to define it rather than curly brackets (a *compound statement*). For example: .. code-block:: chapel var j = 1; while j < 100 do j *= 3; writeln("At the end of the second while-loop, j is ", j); If you want to be sure to execute the loop body at least once, you'll typically want to use the ``do...while`` form: .. code-block:: chapel var k = 1; do { writeln("Within the third while-loop, k is ", k); k *= 4; } while k < 100; One way in which Chapel's ``do...while`` loops may differ from those you've used in other languages is that the condition expression can refer to symbols declared within the loop. For example, the test against ``i`` in the following loop refers to the local per-iteration constant declared within the loop's body rather than the variable defined above to drive our first while-loop. .. code-block:: chapel do { const i = readln(int); writeln("Within the fourth while-loop, i is ", i); } while i < 100; .. _For Loops: For Loops ~~~~~~~~~ Chapel's other serial loop form is the *for-loop*. Here is a simple for-loop that iterates over the integers 1 through 3, inclusive: .. code-block:: chapel for i in 1..3 { writeln("Within the first for-loop, i is ", i); } Though this example may look and act a lot like the C loop ``for (i=1; i<=3; i++)``, the way it works is somewhat different. Specifically, in Chapel a for-loop always invokes a *serial iterator*. In more detail: Chapel for-loops generally take the form: ``for [inds] in [expr]``, where ``[expr]`` is the *iterand expression* that the loop is traversing. When this iterand is a call to a Chapel iterator, the loop will invoke that iterator. If the iterand is a variable or value of a given type, the loop invokes that type's default serial iterator method. See the :ref:`Iterators Primer ` for more about defining iterators. The iterand in the loop above is the range value ``1..3``, so the loop invokes the ``range`` type's default serial iterator method, which yields the range's indices one at a time. For more about ranges, see the :ref:`Ranges Primer `. As values are yielded back to a for-loop, they are bound to the loop's *index variable(s)*. In this case, the index variable is ``i``. A for-loop's index or indices are brand new identifiers introduced by the loop, and each iteration of the loop can be thought of as getting its own private copy of the index variable. An implication of this is that the ``i`` variable in the loop above is new and distinct from previous ``i`` variables that appeared earlier in this primer. Another implication is that a for-loop's index values will not be carried from one iteration of the loop to the next, nor persist after the loop completes. If you want such behaviors, you'll need to use a while-loop, or to declare additional variables outside of the for-loop that will persist across iterations. The ``range`` type's iterator yields its indices by ``out`` intent, preventing the loop's index variable ``i`` from being assigned within the loop body. In effect, the loop above can be thought of as being equivalent to: .. code-block:: chapel { const i = 1; writeln("Within the first for-loop, i is ", i); } { const i = 2; writeln("Within the first for-loop, i is ", i); } { const i = 3; writeln("Within the first for-loop, i is ", i); } Other iterators may yield their indices using the ``ref`` intent, permits the loop index variable to be modified. We'll see an example of this in the next section. .. _loops-arrs-doms: Loops over Arrays and Domains ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In addition to looping over ranges and explicit iterators, loops in Chapel are commonly used to iterate over arrays or domains (see the :ref:`Arrays ` and :ref:`Domains ` Primers for details on these types). When iterating over an array variable, its serial iterator yields references to the array's elements, permitting them to be read or modified within the loop. For example: .. code-block:: chapel var A = [1.2, 3.4, 5.6, 7.8]; for a in A { writeln("The second for-loop is doubling ", a); a *= 2; } writeln("After the second for-loop, A is: ", A); When iterating over a domain, the corresponding index variable represents a read-only copy of the domain's indices: .. code-block:: chapel for i in A.domain { writeln("In the third for-loop, element ", i, " of A is ", A[i]); } For a multidimensional domain, the index variable will be a tuple, and indices will be yielded in row-major order: .. code-block:: chapel const Dom2D = {1..3, 1..3}; for idx in Dom2D { writeln("The fourth for-loop got index ", idx, " from Dom2D"); } Like other tuples in Chapel, such indices can be de-tupled into their distinct components: .. code-block:: chapel for (row, col) in Dom2D do writeln("The fifth for-loop got row = ", row, " and col = ", col); This last example also demonstrates that single-statement for-loops can be declared using the ``do`` keyword, similar to what we saw with while-loops above. .. _zip-iter: Zippered For-Loops ~~~~~~~~~~~~~~~~~~ For-loops also support *zippered* iteration, in which multiple iterand expressions are invoked in a coordinated manner, yielding multiple indices. For example, the following loop iterates over two ranges and an array in a zippered manner: .. code-block:: chapel for idx in zip(1..4, 1..8 by 2, A) do writeln("Within the first zippered for-loop, idx is ", idx); Note that the iterands in a zippered loop need to have compatible sizes and shapes. In this case, each of the two ranges represent four indices, and the array is 1-dimensional and has four elements, so this is a legal zippering. A zippered loop generates a tuple index, storing one component for each iterand expression being zipped together. As a result, in the loop above, ``idx`` is a 3-tuple, where the first two components are integers representing the indices yielded by ``1..4`` and ``1..8 by 2``, respectively; the third element refers to the elements of ``A``. Like other tuples, such indices can be de-tupled into their distinct components in the loop header: .. code-block:: chapel for (i, j, a) in zip(1..4, 1..8 by 2, A) do writeln("Within the second zippered for-loop, i, j, and a are: ", (i,j,a)); Zippered for-loops can iterate over an arbitrary number of iterand expressions. .. _param-for-loops: Statically Varying (Unrolled) For-Loops ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ One last case to know about is that Chapel has a few for-loop forms that support the ability to have distinct static types or values per iteration. This is achieved by unrolling the for-loop at compile-time to create distinct copies of the loop body that represent the different static properties. The two primary ways to write such for-loops today are by iterating over: * a heterogeneous tuple * a range value whose low and high bounds are both ``param`` expressions and whose index variable is also declared as a ``param``: .. code-block:: chapel const tup = (1, 2.3, "four"); for t in tup do writeln("One component of 'tup' has type ", t.type:string); for param i in 0..`. A parallel iterator can create as many tasks as it wants, and can specify where they should run. By convention, most will create as many tasks as are appropriate for the hardware parallelism that the loop iterand targets. For example, the default parallel iterator for a range, local domain, or local array typically implements its iterations using a number of tasks equal to the number of local processor cores that are available, since those data structures are stored on a single locale. In contrast, the default parallel iterator for a distributed domain or array will typically implement the iterations using all of the available processor cores on all of the locales that own a subset of the domain's indices or array's elements. The task that originally encountered the forall-loop will not proceed past the loop until all tasks created by the parallel iterator to run the loop's iterations have completed. Logically, you can think of there as being a *join* operation on all tasks that are helping to implement the forall-loop. Looking at some simple examples, when run on a k-core processor, each of the following loops will typically use ``k`` tasks to implement the loop's iterations in parallel (at least, when ``k > n``): .. code-block:: chapel config const n = 1000; var B: [1..n] real; forall i in 1..n do B[i] = i: real; writeln("After the forall loop over a range, B is: ", B); forall i in B.domain do B[i] = A[i % A.size]; writeln("After the forall loop over a domain, B is: ", B); forall b in B do b = -b; writeln("After the forall loop over an array, B is: ", B); Note the presence of the ``with``-clause in the first two loops above. By default, variables declared outside of a parallel loop in Chapel, like ``B`` here, will be represented by a ``const`` shadow version of the variable within the loop itself. This is designed to avoid inadvertent race conditions within the loop, by preventing multiple iterations of the loop from writing to the same variable simultaneously. As a result, if it is your intention to modify the variable, using a ``with``-clause is a way to say how that variable should be made available to the loop. In the cases above that want to write to ``B``'s elements using ``B[i]``, the shadow variable for ``B`` would prevent such assignments since it is ``const``. So we override that behavior using the ``with``-clauses to say that ``B`` should be made available to the loop body by reference (``ref``). For more information on task intents and ``with``-clauses, refer to the :ref:`primers-forallLoops` and :ref:`primers-taskParallel` primers. Note, however, that when we're looping over the array itself, as in the last forall-loop above or the foreach-loops in the previous section, there's no need for such an intent because we're not modifying the shadow variable of something declared outside the loop; instead, it's just the loop's indices themselves, which do not receive shadow variables since they are already private to the iteration, and therefore not amenable to races. Next, let's consider some forall-loops over distributed domains and arrays. When we iterate over a domain or array that's distributed across multiple locales, each with ``k`` cores, each locale will tend to use its ``k`` cores to iterate over the subset of the domain or array that it owns locally: .. code-block:: chapel use BlockDist; const BlockDom = blockDist.createDomain({1..n, 1..n}); var C: [BlockDom] real; forall (i,j) in BlockDom do C[i,j] = (100 * here.id) + i + j/1000.0; writeln("After the forall loop over a distributed domain, C is:\n",C); forall c in C do c *= 2; writeln("After the forall loop over a distributed array, C is:\n", C); Because forall-loops invoke parallel iterators, the tasks they create and where they run are not defined by the Chapel language, but by the iterators themselves. Any type supporting parallel iteration should describe how its parallel iterators work as part of its user-facing documentation. For more about distributed domains and arrays or parallel iterators, refer to the :ref:`primers-distributions` and :ref:`primers-parIters` primers. .. _Square-Bracket Loops: Square-Bracket Loops ~~~~~~~~~~~~~~~~~~~~ A third data-parallel loop form uses square brackets to define the loop instead of the ``foreach`` or ``forall`` keywords. For example, such a loop may look like: .. code-block:: chapel [i in 1..n] B[i] -= 0.001; writeln("After the first square bracket loop, B is:\n", B); [c in C] c -= 0.001; writeln("After the second square bracket loop, C is:\n", C); In this loop form, the square brackets can be thought of as replacing the ``for[each|all]`` and ``do`` keywords, respectively. This loop is both a shorthand for a data parallel loop, while also supporting a "sliding scale" of parallelism. Specifically, it will be equivalent to a ``forall`` loop if its iterand has/is a parallel iterator, and a ``foreach`` loop otherwise. .. _loops-promotion: Promotion and Data-Parallel Loops ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In Chapel, an operator or procedure accepting a formal argument of type ``t`` can be *promoted* by invoking the procedure with: * an array whose elements are of type ``t`` * a range or domain whose indices are of type ``t`` Such promotions are equivalent to ``forall`` loops that iterate over each of the promoted actual arguments in a zippered manner, passing the respective elements into the operator or procedure. For example, given the procedure: .. code-block:: chapel proc foo(ref x: real, t: (int, int), d: real) { x = t(0) + t(1)/d; } The call: .. code-block:: chapel foo(C, BlockDom, 100.0); writeln("After the promoted call to foo(), C is:\n", C); is equivalent to the forall-loop: .. code-block:: chapel forall (c, ij) in zip(C, BlockDom) do foo(c, ij, 100.0); writeln("After the equivalent zippered forall loop, C is:\n", C); As a result, the parallel calls to ``foo()`` will be executed using the available processor cores on each of the locales that own a portion of ``C`` since ``C`` is the first iterand in the `zip` expression. .. _race-conditions: A final note on data-parallel loops and legality / races ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As mentioned previously, the Chapel compiler and language are not responsible for making sure that a data-parallel loop is safe to execute in parallel. Shadow variables reduce the chances of an accidental race, but do not protect against them. If a programmer writes a data-parallel loop that is not parallel-safe or that creates a race, the outcome is their responsibility, not Chapel's. As an example, the following loop may appear to replace the interior elements of an array with the average of their neighbors; yet, because the same elements may be read and written simultaneously by distinct parallel iterations, the results will be unpredictable depending on how the iterations are scheduled at execution-time: .. code-block:: chapel var D = [i in 1..n] (i**2): real; writeln("Before the race-y averaging loop, D is: ", D); forall i in 2..`. When using coforall loops, keep in mind that a task will be created for each iteration, and that each task will consume memory and processing resources. For example, a coforall loop with a million iterations will literally create a million tasks. If you don't have a million cores to run them on, this is likely to be overkill, requiring more memory and processing power than is warranted.. If there is no explicit synchronization between the iterations, a forall loop is typically a better choice, since it would use a number of tasks proportional to the targeted hardware parallelism. Closing Discussions ------------------- At this point, you've learned about all of Chapel's loop forms. The remaining sections cover some loop-related topics that may come up in practice. .. _loop-nests: Nesting Loops ~~~~~~~~~~~~~ The loop forms discussed here can be nested arbitrarily, and their definitions are the same whether they are an outer or inner loop. A nested for-loop will perform nested serial iterations as in other languages. A nested coforall-loop will create a number of tasks equal to the outer loop's trip count and the sum of all the inner loops' counts. For example, the following loop will create around ``x**2`` tasks, since each iteration of each loop will create its own task. .. code-block:: chapel config const x = 4; writeln(); coforall i in 1..x do coforall j in 1..x do writeln("Here's a message from one of the nested coforall tasks"); A tricky case to reason about in a nested loop situation is the forall-loop since its implementation is essentially "Do whatever the parallel iterator says." If a parallel iterator were to always create ``x`` tasks (say), then a nested forall loop invoking that iterator in both loops would create roughly ``x**2`` tasks as in the coforall example above. In practice, however, most parallel iterators (including those defined on ranges, domains, and arrays) will take stock of the number of tasks running on the current locale and then throttle the number of tasks they create to avoid overwhelming the node. As a result, a nested forall-loop over a pair of ranges, like: .. code-block:: chapel forall i in 1..500 { forall j in 1..500 { // do some computation here } } will typically create only ``numCores`` tasks in total. Specifically, the outer loop will create ``numCores`` tasks, then each of the inner loops will see that all the cores are busy and avoid creating additional tasks since there is nowhere to run them. In such cases, if the inner loop body was not computationally intensive, it could make sense to rewrite the inner loop as a ``foreach`` in order to avoid the overhead of having the iterator determine whether or not to create tasks at all: .. code-block:: chapel forall i in 1..500 { foreach j in 1..500 { // do some computation here } } That said, such overheads are relatively modest, so for loop bodies that are computationally intensive, the benefit of changing the inner loop from ``forall`` to ``foreach`` may be negligible. In summary, there is nothing magical about nested loops. When reasoning about what a given loop nest does, consider the loops one at a time. For example, what does the outer loop do? ("It's a ``forall``, so it will invoke the parallel iterator specified by its iterand.") OK, what about the next loop? ("It's a `coforall`, so it will literally create a task per iteration regardless of how many are already running). What about the next loop? ("It's a ``foreach``, so it will try to use hardware features in the task's current target processor to implement its iterations"). Chapel's implementation of parallel loops is very imperative, where the most complex case is being familiar with the parallelism implemented by any iterand expressions of a ``forall`` loop. .. _loop-choice: When to Use Which Loop Form? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Given these various loop forms, which ones should you use when? Starting with the obvious, if you have a loop that wants or needs to be serial, such as a loop spanning the time steps of a simulation, you should use one of the serial loop forms. When writing a serial loop, if you are iterating over a type that supports iteration, like an array, domain, range, or list, the for-loop can often be the clearest and most elegant loop form. Or, if a serial iterator exists that does what you want, invoking it with a for-loop is also an obvious choice. But if you want to do something more general that is not currently supported by an iterator, the while-loop can serve as a more general fallback. Or you might want to write an iterator of your own that wraps your unique serial loop structure, and then use a for-loop to invoke it. Refer to the :ref:`Iterators Primer ` for more information about doing this. When choosing between the parallel loop forms, one consideration should be how many iterations the loop has. For example, if you're iterating over an array with a million elements, you typically wouldn't want to use a coforall-loop, since that would literally create a million tasks. Unless you happen to have a million processor cores, this is probably overkill. And even if you do have a million cores, each would only own one element; so if your loop's body was not computationally intensive, you may spend more time creating and destroying tasks than actually getting useful work done. In such cases, the forall- or foreach-loops could be a better choice since they will create parallelism proportional to the computational resources that are storing the array. Another consideration is whether you require synchronization or coordination between distinct iterations of the loop. If you do, the coforall-loop is probably the right choice since it's the only one that permits inter-iteration coordination, since each iteration will be executed by a distinct task. When you do not require synchronization between iterations, the forall-loop or square-bracket loop are generally good defaults to reach for since they will make best use of the hardware parallelism corresponding to the iterand expression. The foreach loop can serve as an alternative if you know that you're already running a number of tasks that will saturate your hardware parallelism, or if the loop itself is of sufficiently modest size or computational intensity that creating new tasks to execute it would be overkill. .. _common-loop-mistake: A Common Performance Mistake ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Wrapping up, one of Chapel's most powerful features — the fact that forall loops can generate distributed memory parallelism in addition to local, multi-core parallelism — can also be the cause of simple errors that can kill performance. When writing forall-loops, it is important to consider the iterand expression and what computational resources it will use. Here is an example that illustrates how things can go wrong: .. code-block:: chapel // create a block-distributed array G var G = blockDist.createArray(1..10, int); // attempt (but fail) to iterate over G's elements in a parallel, // distributed manner forall i in 1..10 do G[i] = i; writeln("After the non-distributed forall, G is: ", G); While the code, as written, will work properly, the comment is incorrect in expecting that the computation will be distributed. Specifically, even though ``G`` is distributed and accessed within the loop, the forall loop's iterand is a range and ranges are not distributed. As a result, the range's default iterator method will only consider the local cores when deciding how many tasks to create and where to run them. The loop's body will still be able to update the remote elements of ``G`` by virtue of Chapel's global namespace. We can see that this is the case by storing the ID of the locale that executes each iteration into G: .. code-block:: chapel forall i in 1..10 do G[i] = here.id; writeln("The locales assigning to G (range version) were: ", G); In order to get the correct behavior, we'd need to iterate over something distributed instead, like G itself: .. code-block:: chapel G = -1; // reset G forall g in G do g = here.id; writeln("The locales assigning to G (array version) were: ", G); Or its domain, which is also distributed: .. code-block:: chapel G = -1; // reset G forall i in G.domain do G[i] = here.id; writeln("The locales assigning to G (domain version) were: ", G); In addition to iterating over arrays and domains, iterating over slices of arrays and domains is another technique for making sure your forall-loop computations maintain locality and affinity. For example: .. code-block:: chapel G = -1; // reset G forall g in G[2..G.size-1] do g = here.id; writeln("The locales assigning to a slice of G were: ", G[2..G.size-1]); As a final note, the following pattern can be a particularly surprising instance of the above: .. code-block:: chapel forall loc in Locales { on loc { // do some computation } } Although the `Locales` array represents the set of distributed locales on which the program is running, it is implemented using a local array on each locale. As a result, the parallelism generated by this loop structure will once again be based on the number of local cores, implying that if ``numLocales >> here.maxTaskPar``, you will not end up executing on all the locales simultaneously. A better approach would be to use a coforall-loop: .. code-block:: chapel coforall loc in Locales { on loc { // do some computation } } This will create a task per locale regardless of the number of local cores, ensuring that all locales end up computing simultaneously. The lesson here is to make sure you're iterating over a distributed expression when you want your forall-loop to parallelize across a number of locales greater than the number of local cores. Conclusion ---------- That wraps up this primer introducing Chapel's various loop types. For further details, refer to the Chapel language specification or ask questions in our user support channels.