.. _readme-tasks:

============
Chapel Tasks
============

.. contents::

--------
Overview
--------

Chapel programs create new tasks via the begin, cobegin, and coforall
statements.  Tasks are computations that can conceptually execute
concurrently, though they may or may not do so in practice.

An implementation of Chapel must include at least one tasking layer.
A tasking layer will in turn implement *threads* which are a mechanism
for executing work in parallel.

All tasking layers support configuration constants to control system
resources such as the number of threads that are available to execute
tasks and the amount of call stack space reserved for each task.  Generally
speaking, the Chapel programmer can make no assumptions about the
scheduling of threads or the mapping of tasks to threads other than
those semantics defined by the language specification.

This document describes the currently-supported tasking options in more
detail.  The rest of this document includes:

* an overview of the different tasking options
* a detailed description of each tasking option
* a discussion of the number of threads used by each tasking option
* a discussion of call stack sizes and overflow handling
* a list of tasking-related methods on the locale type
* a brief description of future directions for the tasking layer


--------------------------
Task Implementation Layers
--------------------------

This release contains two distinct tasking layers for Chapel tasks.
The user can select between these options by setting the ``CHPL_TASKS``
environment variable to one of the following values:

:qthreads:
  best performance; default for most targets

:fifo:
  most portable, but heavyweight; default for NetBSD and Cygwin

Each tasking layer is described in more detail below:


CHPL_TASKS == qthreads
----------------------

Chapel's default tasking layer implementation for most targets is based
on the Qthreads user-level threading package from Sandia National Labs.
This provides a lightweight implementation of Chapel tasking as well as
an optimized implementation of sync variables.  To use qthreads tasking,
please take the following steps:

1) Ensure that the environment variable ``CHPL_HOME`` points to the
   top-level Chapel directory.

2) Set up your environment to use Qthreads:

   ensure ``CHPL_TASKS`` is not set (if qthreads is the default)

   -- or --

   .. code-block:: sh

     export CHPL_TASKS=qthreads

3) Follow the instructions in :ref:`readme-chplenv` to set up,
   compile and run your Chapel programs.

Please report any apparent bugs in Qthreads tasking to the Chapel team.


Stack overflow detection
========================

The qthreads tasking implementation can arrange to halt programs when
any task overflows its call stack (see `Task Call Stacks`_).
It does this by placing a guard page, which cannot be referenced, at the
end of each task stack.  When a task tries to extend its stack onto a
guard page, it fails with a segfault.

Normally guard pages for stack overflow detection are configured and
enabled.  There is a performance cost for this, however.  We do not have
a quantitative estimate for this cost, but it is a fixed overhead (a
couple of system calls) added to the time needed to run every task, so
qualitatively speaking it will have a greater effect on programs which
create more or shorter-lived tasks than on programs which create fewer
or longer-lived ones.

As described in `Task Call Stacks`_, the execution-time default for
stack overflow checking can be set by using the ``--[no-]stack-checks``
compiler option.  But whatever the default is, at execution time stack
overflow detection can be turned off by setting the environment variable
``QT_GUARD_PAGES`` to any of the values "0", "no", or "false", or on by
setting it to any of "1", "yes", or "true".  When it is off the execution
overhead is negligible (just a couple of scalar tests).  Developers
who wish to remove even this small cost can disable guard pages by
building qthreads with guard pages entirely configured out, as follows:

.. code-block:: sh

  cd $CHPL_HOME/third-party/qthread
  make CHPL_QTHREAD_NO_GUARD_PAGES=yes ... clean all

As noted, running without guard pages can improve performance and thus
may be desirable for production work.  However, if this is done, test
runs at similar scale with guard pages turned on to check for stack
overflow should be done beforehand if possible, because undetected stack
overflows can cause subtle and intermittent errors in execution.


Environment variables
=====================

Qthreads provides a number of environment variables that can be used to
configure its behavior at execution time.  An introduction to these can
be found in the ENVIRONMENT section of the qthread_init man page.  (Note
that although this man page documents variables named ``QTHREAD_*``, each
variable is actually present in both ``QT_*`` and ``QTHREAD_*`` forms,
with the former superseding the latter.)  The qthreads man pages are
available by means of the man -M option, for example:

.. code-block:: sh

  man -M $CHPL_HOME/third-party/qthread/qthread-src/man qthread_init

Note that in some cases there are Chapel environment variables that
override Qthreads counterparts. ``CHPL_RT_NUM_THREADS_PER_LOCALE`` overrides
``QT_HWPAR``, for example.  Whenever a Chapel variable overrides a Qthreads
variable, you should use the Chapel one.


Worker affinity and number
==========================

Simplistically, there are two kinds of threads in Qthreads: shepherds
that manage work distribution, and workers that host qthreads (Chapel
tasks, for our purposes).  The execution-time environment variable
``QT_WORKER_UNIT`` controls how worker threads are distributed on hardware
processors.  The default is "core" to distribute workers across CPU
cores (physical processors).  An alternative is "pu", which distributes
workers across processing units.  These are instances of the processor
architecture, or hardware threads if the cores have those.  Note that
"pu" will be automatically selected if ``CHPL_RT_NUM_THREADS_PER_LOCALE``
is set to anything larger than the number of cores, so it usually isn't
necessary to set ``QT_WORKER_UNIT``.


.. _overloading-with-qthreads:

Overloading system nodes
========================

By default the qthreads tasking implementation is set up to assume that
its process is not competing with anything else for system resources
(CPUs and memory) on its system node.  In this mode, qthreads optimizes
its internal behavior to favor performance over load balancing.  This
works out well for Chapel programs, because normally Chapel runs with
one process (locale) per system node.  However, with ``CHPL_COMM=gasnet``
or ``CHPL_COMM=ofi`` one can run multiple Chapel locales on a single
system node, say for doing multilocale functional correctness testing
with limited system resources.  (See :ref:`readme-multilocale` for more
details.)  When this is done qthreads' optimization for performance can
greatly reduce performance, due to resource starvation among multiple
Chapel processes.  If you need qthreads to share system resources more
cooperatively with other processes set ``CHPL_RT_OVERSUBSCRIBED=yes`` at
execution time (see :ref:`oversubscribed-execution`).


Hwloc
=====

When ``CHPL_TASKS=qthreads``, the default for ``CHPL_HWLOC`` becomes "bundled",
and the hwloc third-party package will be built.  Qthreads depends on
this package to provide it with a description of the locale hardware, to
support locality and affinity operations.


Further information
===================

For more information on Qthreads, see $CHPL_HOME/third-party/README.


CHPL_TASKS == fifo
------------------

FIFO tasking over POSIX threads (or pthreads) works on all
platforms and is the default for Cygwin and NetBSD. It is
attractive in its portability, though on most platforms it will
tend to be heavier weight than Chapel strictly requires.  FIFO
tasking is also used when Chapel is configured in 'Quick Start'
mode (see :ref:`chapelhome-quickstart`).  To use FIFO tasking,
please take the following steps:

1) Ensure that the environment variable ``CHPL_HOME`` points to the
   top-level Chapel directory.

2) Set up your environment to use FIFO tasking:

   .. code-block:: sh

     export CHPL_TASKS=fifo

3) Follow the instructions in :ref:`readme-chplenv` to set up,
   compile and run your Chapel programs.

In the FIFO tasking implementation, Chapel tasks are mapped to threads
such that each task is executed by a single thread and is run to
completion before giving up that thread.  As a result, a program can
have no more tasks active (that is, created and started) at any given
time than it has threads on which to run those tasks.  It can create
more tasks than threads, but no more tasks will be run at any time
than there are threads.  Excess tasks are placed in a pool where they
will be picked up and started by threads as they complete their tasks.

The threading implementation uses POSIX threads (pthreads) to run Chapel
tasks.  Because pthreads are relatively expensive to create, it does not
destroy them when there are no tasks for them to execute.  Instead they
stay around and continue to check the task pool for tasks to execute.
Setting the number of pthreads is described in `Controlling the Number of Threads`_.


Stack overflow detection
========================

The fifo tasking implementation can arrange to halt programs when any
task overflows its call stack (see `Task Call Stacks`_).  It does
this by placing a guard page, which cannot be referenced, at the end of
each task stack.  When a task tries to extend its stack onto a guard
page, it fails with a segfault.

This feature is enabled in fifo tasking and cannot currently be turned
off.  There is a performance cost for it, which we expect to be small in
most cases.  We do not have a quantitative estimate for this cost, but
it is a fixed overhead (a couple of system calls) added to the time
needed to start each pthread.  Since the pthreads in fifo tasking are
long-lived and can host many tasks over their lifespan, on a per-task
basis we don't expect stack overflow detection to be expensive.


---------------------------------
Controlling the Number of Threads
---------------------------------

The number of threads per compute node used to implement a Chapel
program can be controlled by the ``CHPL_RT_NUM_THREADS_PER_LOCALE``
environment variable.  This may be set to either an explicit number
or one of the following symbolic strings:

  :'MAX_PHYSICAL': number of physical CPUs (cores) on the node
  :'MAX_LOGICAL':  number of logical CPUs (hyperthreads) on the node

If ``CHPL_RT_NUM_THREADS_PER_LOCALE`` is not set, the number of threads is
left up to the tasking layer.  See the case-by-case discussions below
for more details.

The Chapel program will generate an error if the requested number of
threads per locale is too large.  For example, when running multi-locale
programs, the GASNet communication layer typically places an upper bound
of 127 or 255 on the number of threads per locale (There are ways to
work around this assumption on certain platforms -- please contact us if
you need to do so, or peruse the GASNet documentation).

CHPL_TASKS == fifo
------------------
  The value of ``CHPL_RT_NUM_THREADS_PER_LOCALE`` indicates the maximum
  number of threads that the fifo tasking layer can create on each
  locale to execute tasks.  These threads are created on a demand-driven
  basis, so a program with a small number of concurrent tasks may never
  create the specified number.  If the value is zero, then the number of
  threads will be limited by system resources and other constraints
  (such as GASNet's configuration-time limit).

  The value of ``CHPL_RT_NUM_THREADS_PER_LOCALE`` can have a major impact on
  performance for fifo tasking.  For programs with few inter-task
  dependences and high computational intensity, setting it roughly equal
  to the number of physical CPUs on each locale can lead to near-optimal
  performance.  However, for programs with lots of fine-grained
  synchronization in which tasks frequently block on sync or single
  variables, ``CHPL_RT_NUM_THREADS_PER_LOCALE`` can often exceed the number
  of physical CPUs without an adverse effect on performance since
  blocked threads will not consume the CPU's cycles.

  Note that setting ``CHPL_RT_NUM_THREADS_PER_LOCALE`` too low can result in
  program deadlock for fifo tasking.  For example, for programs written
  with an assumption that some minimum number of tasks are executing
  concurrently, setting ``CHPL_RT_NUM_THREADS_PER_LOCALE`` lower than this
  can result in deadlock if there are not enough threads to implement
  all of the required tasks.

  When ``CHPL_RT_NUM_THREADS_PER_LOCALE`` is set, a warning is issued like:

  .. code-block:: text

    warning: Setting number of threads in CHPL_TASKS=fifo can lead to deadlock

  and can be suppressed with ``CHPL_RT_NUM_THREADS_PER_LOCALE_QUIET=yes``

CHPL_TASKS == qthreads
----------------------
  In the Qthreads tasking layer, ``CHPL_RT_NUM_THREADS_PER_LOCALE``
  specifies the number of system threads used to execute tasks.  The
  default is to use a number of threads equal to the number of physical
  CPUs on the locale.

----------------
Task Call Stacks
----------------

Each task including the main Chapel program has an associated call
stack.  As documented in :ref:`readme-executing`, the ``CHPL_RT_CALL_STACK_SIZE``
environment variable can be used to specify how big these call stacks
will be during execution.  See there for a full description of this
environment variable and the values it can take.

When a task's call chain becomes so deep that it needs more space than
the size of its call stack, stack overflow occurs.  Whether or not a
program checks for stack overflow checking at execution time can be
specified when it is compiled, via the ``--[no-]stack-checks`` compilation
option.  The compile-time default is ``--stack-checks``; ``--no-stack-checks``
can be given directly, and is also implied by ``--no-checks``, which in turn
is implied by ``--fast``.  By default stack overflow checks are enabled.

Chapel does not yet have a consistent, implementation-independent way to
deal with call stack overflow.  Each tasking layer implementation
handles stacks and stack overflow in its own way, as described below.

CHPL_TASKS == fifo
------------------
  In fifo tasking, Chapel tasks use their host pthreads' stacks when
  executing.  If stack checks are enabled, these stacks are created with
  an additional memory page called a "guard page" beyond their end, that
  is marked so that it cannot be referenced.  When stack overflow occurs
  the task's attempt to reference the guard page will cause the OS to
  react as it usually does when bad memory references are done.  On
  Linux, for example, it will kill the program with this message:

    Segmentation fault

  Unfortunately, many other things that cause improper memory references
  result in this same kind of program termination, so as a diagnostic it
  is ambiguous.  However, it does at least prevent the program from
  continuing on in an erroneous state.

CHPL_TASKS == qthreads
----------------------
  Like fifo tasks (see above), qthreads tasking can place guard pages
  beyond the ends of task stacks.  Stack overflow then results in the
  system's usual response to referencing memory that cannot be reached.
  With qthreads tasking, the compiler ``--stack-checks`` setting specifies
  the default setting for execution-time stack overflow checking.  Final
  control over stack overflow checks is provided by the ``QT_GUARD_PAGE``
  environment variable.  See the qthreads subsection of `Task
  Implementation Layers`_ for more information.

----------------------------
Quantifying Tasks on Locales
----------------------------

The locale type has a method available to query the number of tasks that
are running on a given locale.

  runningTasks()
    returns the number of tasks that have been created
    but have not yet finished.  Note that this number can exceed the
    number of threads because tasking layers may be capable of switching
    among multiple Chapel tasks running on a single hosting thread.

In order to use this method, you have to specify the locale you wish to
query, as in here.runningTasks(), where 'here' is the current locale.


-------------------------
Future Tasking Directions
-------------------------

As Chapel's task parallel implementation matures, we expect to have
multiple task->thread scheduling policies, from literally creating and
destroying new threads with each task (for programmers who want full
control over a thread's lifetime) to automated work stealing and load
balancing at the other end of the spectrum (for programmers who would
prefer not to manage threads or whose programs cannot trivially be
load balanced manually).  Our hope is to leverage existing open source
threading and task management software and to collaborate with others
in these areas, so please contact us if you'd
like to work with us in this area.