Chapel Tasks¶
Overview¶
Chapel programs create new tasks via the begin, cobegin, and coforall statements. Tasks are computations that can conceptually execute concurrently, though they may or may not do so in practice.
An implementation of Chapel must include at least one tasking layer. A tasking layer will in turn implement threads which are a mechanism for executing work in parallel.
All tasking layers support configuration constants to control system resources such as the number of threads that are available to execute tasks and the amount of call stack space reserved for each task. Generally speaking, the Chapel programmer can make no assumptions about the scheduling of threads or the mapping of tasks to threads other than those semantics defined by the language specification.
This document describes the currently-supported tasking options in more detail. The rest of this document includes:
an overview of the different tasking options
a detailed description of each tasking option
a discussion of the number of threads used by each tasking option
a discussion of call stack sizes and overflow handling
a list of tasking-related methods on the locale type
a brief description of future directions for the tasking layer
Task Implementation Layers¶
This release contains two distinct tasking layers for Chapel tasks.
The user can select between these options by setting the CHPL_TASKS
environment variable to one of the following values:
- qthreads:
best performance; default for most targets
- fifo:
most portable, but heavyweight; default for NetBSD and Cygwin
Each tasking layer is described in more detail below:
CHPL_TASKS == qthreads¶
Chapel’s default tasking layer implementation for most targets is based on the Qthreads user-level threading package from Sandia National Labs. This provides a lightweight implementation of Chapel tasking as well as an optimized implementation of sync variables. To use qthreads tasking, please take the following steps:
Ensure that the environment variable
CHPL_HOME
points to the top-level Chapel directory.Set up your environment to use Qthreads:
ensure
CHPL_TASKS
is not set (if qthreads is the default)– or –
export CHPL_TASKS=qthreads
Follow the instructions in Setting up Your Environment for Chapel to set up, compile and run your Chapel programs.
Please report any apparent bugs in Qthreads tasking to the Chapel team.
Stack overflow detection¶
The qthreads tasking implementation can arrange to halt programs when any task overflows its call stack (see Task Call Stacks). It does this by placing a guard page, which cannot be referenced, at the end of each task stack. When a task tries to extend its stack onto a guard page, it fails with a segfault.
Normally guard pages for stack overflow detection are configured and enabled. There is a performance cost for this, however. We do not have a quantitative estimate for this cost, but it is a fixed overhead (a couple of system calls) added to the time needed to run every task, so qualitatively speaking it will have a greater effect on programs which create more or shorter-lived tasks than on programs which create fewer or longer-lived ones.
As described in Task Call Stacks, the execution-time default for
stack overflow checking can be set by using the --[no-]stack-checks
compiler option. But whatever the default is, at execution time stack
overflow detection can be turned off by setting the environment variable
QT_GUARD_PAGES
to any of the values “0”, “no”, or “false”, or on by
setting it to any of “1”, “yes”, or “true”. When it is off the execution
overhead is negligible (just a couple of scalar tests). Developers
who wish to remove even this small cost can disable guard pages by
building qthreads with guard pages entirely configured out, as follows:
cd $CHPL_HOME/third-party/qthread
make CHPL_QTHREAD_NO_GUARD_PAGES=yes ... clean all
As noted, running without guard pages can improve performance and thus may be desirable for production work. However, if this is done, test runs at similar scale with guard pages turned on to check for stack overflow should be done beforehand if possible, because undetected stack overflows can cause subtle and intermittent errors in execution.
Environment variables¶
Qthreads provides a number of environment variables that can be used to
configure its behavior at execution time. An introduction to these can
be found in the ENVIRONMENT section of the qthread_init man page. (Note
that although this man page documents variables named QTHREAD_*
, each
variable is actually present in both QT_*
and QTHREAD_*
forms,
with the former superseding the latter.) The qthreads man pages are
available by means of the man -M option, for example:
man -M $CHPL_HOME/third-party/qthread/qthread-src/man qthread_init
Note that in some cases there are Chapel environment variables that
override Qthreads counterparts. CHPL_RT_NUM_THREADS_PER_LOCALE
overrides
QT_HWPAR
, for example. Whenever a Chapel variable overrides a Qthreads
variable, you should use the Chapel one.
Worker affinity and number¶
Simplistically, there are two kinds of threads in Qthreads: shepherds
that manage work distribution, and workers that host qthreads (Chapel
tasks, for our purposes). The execution-time environment variable
QT_WORKER_UNIT
controls how worker threads are distributed on hardware
processors. The default is “core” to distribute workers across CPU
cores (physical processors). An alternative is “pu”, which distributes
workers across processing units. These are instances of the processor
architecture, or hardware threads if the cores have those. Note that
“pu” will be automatically selected if CHPL_RT_NUM_THREADS_PER_LOCALE
is set to anything larger than the number of cores, so it usually isn’t
necessary to set QT_WORKER_UNIT
.
Overloading system nodes¶
By default the qthreads tasking implementation is set up to assume that
its process is not competing with anything else for system resources
(CPUs and memory) on its system node. In this mode, qthreads optimizes
its internal behavior to favor performance over load balancing. This
works out well for Chapel programs, because normally Chapel runs with
one process (locale) per system node. However, with CHPL_COMM=gasnet
or CHPL_COMM=ofi
one can run multiple Chapel locales on a single
system node, say for doing multilocale functional correctness testing
with limited system resources. (See Multilocale Chapel Execution for more
details.) When this is done qthreads’ optimization for performance can
greatly reduce performance, due to resource starvation among multiple
Chapel processes. If you need qthreads to share system resources more
cooperatively with other processes set CHPL_RT_OVERSUBSCRIBED=yes
at
execution time (see Oversubscription).
Hwloc¶
When CHPL_TASKS=qthreads
, the default for CHPL_HWLOC
becomes “bundled”,
and the hwloc third-party package will be built. Qthreads depends on
this package to provide it with a description of the locale hardware, to
support locality and affinity operations.
Further information¶
For more information on Qthreads, see $CHPL_HOME/third-party/README.
CHPL_TASKS == fifo¶
FIFO tasking over POSIX threads (or pthreads) works on all platforms and is the default for Cygwin and NetBSD. It is attractive in its portability, though on most platforms it will tend to be heavier weight than Chapel strictly requires. FIFO tasking is also used when Chapel is configured in ‘Quick Start’ mode (see Chapel Quickstart Instructions). To use FIFO tasking, please take the following steps:
Ensure that the environment variable
CHPL_HOME
points to the top-level Chapel directory.Set up your environment to use FIFO tasking:
export CHPL_TASKS=fifo
Follow the instructions in Setting up Your Environment for Chapel to set up, compile and run your Chapel programs.
In the FIFO tasking implementation, Chapel tasks are mapped to threads such that each task is executed by a single thread and is run to completion before giving up that thread. As a result, a program can have no more tasks active (that is, created and started) at any given time than it has threads on which to run those tasks. It can create more tasks than threads, but no more tasks will be run at any time than there are threads. Excess tasks are placed in a pool where they will be picked up and started by threads as they complete their tasks.
The threading implementation uses POSIX threads (pthreads) to run Chapel tasks. Because pthreads are relatively expensive to create, it does not destroy them when there are no tasks for them to execute. Instead they stay around and continue to check the task pool for tasks to execute. Setting the number of pthreads is described in Controlling the Number of Threads.
Stack overflow detection¶
The fifo tasking implementation can arrange to halt programs when any task overflows its call stack (see Task Call Stacks). It does this by placing a guard page, which cannot be referenced, at the end of each task stack. When a task tries to extend its stack onto a guard page, it fails with a segfault.
This feature is enabled in fifo tasking and cannot currently be turned off. There is a performance cost for it, which we expect to be small in most cases. We do not have a quantitative estimate for this cost, but it is a fixed overhead (a couple of system calls) added to the time needed to start each pthread. Since the pthreads in fifo tasking are long-lived and can host many tasks over their lifespan, on a per-task basis we don’t expect stack overflow detection to be expensive.
Controlling the Number of Threads¶
The number of threads per compute node used to implement a Chapel
program can be controlled by the CHPL_RT_NUM_THREADS_PER_LOCALE
environment variable. This may be set to either an explicit number
or one of the following symbolic strings:
- ‘MAX_PHYSICAL’:
number of physical CPUs (cores) on the node
- ‘MAX_LOGICAL’:
number of logical CPUs (hyperthreads) on the node
If CHPL_RT_NUM_THREADS_PER_LOCALE
is not set, the number of threads is
left up to the tasking layer. See the case-by-case discussions below
for more details.
The Chapel program will generate an error if the requested number of threads per locale is too large. For example, when running multi-locale programs, the GASNet communication layer typically places an upper bound of 127 or 255 on the number of threads per locale (There are ways to work around this assumption on certain platforms – please contact us if you need to do so, or peruse the GASNet documentation).
CHPL_TASKS == fifo¶
The value of
CHPL_RT_NUM_THREADS_PER_LOCALE
indicates the maximum number of threads that the fifo tasking layer can create on each locale to execute tasks. These threads are created on a demand-driven basis, so a program with a small number of concurrent tasks may never create the specified number. If the value is zero, then the number of threads will be limited by system resources and other constraints (such as GASNet’s configuration-time limit).The value of
CHPL_RT_NUM_THREADS_PER_LOCALE
can have a major impact on performance for fifo tasking. For programs with few inter-task dependences and high computational intensity, setting it roughly equal to the number of physical CPUs on each locale can lead to near-optimal performance. However, for programs with lots of fine-grained synchronization in which tasks frequently block on sync or single variables,CHPL_RT_NUM_THREADS_PER_LOCALE
can often exceed the number of physical CPUs without an adverse effect on performance since blocked threads will not consume the CPU’s cycles.Note that setting
CHPL_RT_NUM_THREADS_PER_LOCALE
too low can result in program deadlock for fifo tasking. For example, for programs written with an assumption that some minimum number of tasks are executing concurrently, settingCHPL_RT_NUM_THREADS_PER_LOCALE
lower than this can result in deadlock if there are not enough threads to implement all of the required tasks.When
CHPL_RT_NUM_THREADS_PER_LOCALE
is set, a warning is issued like:warning: Setting number of threads in CHPL_TASKS=fifo can lead to deadlockand can be suppressed with
CHPL_RT_NUM_THREADS_PER_LOCALE_QUIET=yes
CHPL_TASKS == qthreads¶
In the Qthreads tasking layer,
CHPL_RT_NUM_THREADS_PER_LOCALE
specifies the number of system threads used to execute tasks. The default is to use a number of threads equal to the number of physical CPUs on the locale.
Task Call Stacks¶
Each task including the main Chapel program has an associated call
stack. As documented in Executing Chapel Programs, the CHPL_RT_CALL_STACK_SIZE
environment variable can be used to specify how big these call stacks
will be during execution. See there for a full description of this
environment variable and the values it can take.
When a task’s call chain becomes so deep that it needs more space than
the size of its call stack, stack overflow occurs. Whether or not a
program checks for stack overflow checking at execution time can be
specified when it is compiled, via the --[no-]stack-checks
compilation
option. The compile-time default is --stack-checks
; --no-stack-checks
can be given directly, and is also implied by --no-checks
, which in turn
is implied by --fast
. By default stack overflow checks are enabled.
Chapel does not yet have a consistent, implementation-independent way to deal with call stack overflow. Each tasking layer implementation handles stacks and stack overflow in its own way, as described below.
CHPL_TASKS == fifo¶
In fifo tasking, Chapel tasks use their host pthreads’ stacks when executing. If stack checks are enabled, these stacks are created with an additional memory page called a “guard page” beyond their end, that is marked so that it cannot be referenced. When stack overflow occurs the task’s attempt to reference the guard page will cause the OS to react as it usually does when bad memory references are done. On Linux, for example, it will kill the program with this message:
Segmentation fault
Unfortunately, many other things that cause improper memory references result in this same kind of program termination, so as a diagnostic it is ambiguous. However, it does at least prevent the program from continuing on in an erroneous state.
CHPL_TASKS == qthreads¶
Like fifo tasks (see above), qthreads tasking can place guard pages beyond the ends of task stacks. Stack overflow then results in the system’s usual response to referencing memory that cannot be reached. With qthreads tasking, the compiler
--stack-checks
setting specifies the default setting for execution-time stack overflow checking. Final control over stack overflow checks is provided by theQT_GUARD_PAGE
environment variable. See the qthreads subsection of Task Implementation Layers for more information.
Quantifying Tasks on Locales¶
The locale type has a method available to query the number of tasks that are running on a given locale.
- runningTasks()
returns the number of tasks that have been created but have not yet finished. Note that this number can exceed the number of threads because tasking layers may be capable of switching among multiple Chapel tasks running on a single hosting thread.
In order to use this method, you have to specify the locale you wish to query, as in here.runningTasks(), where ‘here’ is the current locale.
Future Tasking Directions¶
As Chapel’s task parallel implementation matures, we expect to have multiple task->thread scheduling policies, from literally creating and destroying new threads with each task (for programmers who want full control over a thread’s lifetime) to automated work stealing and load balancing at the other end of the spectrum (for programmers who would prefer not to manage threads or whose programs cannot trivially be load balanced manually). Our hope is to leverage existing open source threading and task management software and to collaborate with others in these areas, so please contact us if you’d like to work with us in this area.