.. _readme-launcher:

================
Chapel Launchers
================

When compiling Chapel programs for multiple locales, a launcher binary
is typically created that will execute the appropriate command(s) to
get your program started. For example, when compiling for multiple
locales, typically two binaries will be generated by the compiler
(e.g., myprogram and myprogram_real). The first binary contains code to get
your program up and running on multiple locales while the second
contains your actual program code.

The goals of the launcher binary are: 

#. to wrap details of job startup in a portable way so that new users
   can quickly get Chapel programs up and running on an unfamiliar
   platform.

#. to perform command-line parsing and error checking prior to
   waiting in a queue or firing off a parallel job in order to save
   time and resources related to simple errors/typos in the command
   line.

#. to preserve Chapel's global-view programming model by permitting
   the user to run their program using a single binary (corresponding
   to the single logical task that executes main()) without getting
   bogged down in questions of numbers of nodes, numbers of cores per
   node, numbers of program instances to start up, etc.

#. if necessary, to coordinate runtime functional activity, such as
   I/O.

Executing a Chapel program using the verbose (``-v``) flag will typically
print out the command(s) used to launch the program, along with any
environment variables the launcher set on its behalf.  It will also
cause the program itself to print additional information about how it
configured itself, though most of this will be of more interest to
Chapel developers than regular users.

Executing using the help (``-h``/``--help``) flag will typically print out
any launcher-specific options in addition to the normal help message for
the program itself.

You can also execute the Chapel launcher with the ``--dry-run`` flag.
This will not actually run or launch the user program, but instead simply
print the same thing as ``-v``: the command(s) that would have been used
to launch the program, along with any environment variables the launcher
would have set on its behalf.  Note that ``--dry-run`` will also cause
batch and other files created for the system launcher to be left behind
so you can inspect and/or reuse them.  Normally these are removed by the
Chapel launcher when the program finishes.  An example of such a file
would be the ``sbatch`` file created when a Slurm-based launcher is used
and ``CHPL_LAUNCHER_USE_SBATCH`` is set.

Currently Supported Launchers
+++++++++++++++++++++++++++++

Currently supported launchers include:

=======================  ==============================================================
Launcher Name            Description
=======================  ==============================================================
amudprun                 GASNet launcher for the UDP substrate
aprun                    Cray application launcher using aprun
gasnetrun_ibv            GASNet launcher for the Infiniband substrate
gasnetrun_mpi            GASNet launcher for the MPI substrate
mpirun4ofi               provisional launcher for ``CHPL_COMM=ofi`` on non-Cray systems
lsf-gasnetrun_ibv        GASNet launcher for LSF (bsub) and the Infiniband substrate
pals                     Cray application launcher for PALS on HPE Cray EX systems
pbs-aprun                Cray application launcher for PBS (qsub) + aprun
pbs-gasnetrun_ibv        GASNet launcher for PBS (qsub) and the Infiniband substrate
slurm |-| gasnetrun_ibv  GASNet launcher for SLURM and the Infiniband substrate
slurm |-| gasnetrun_mpi  GASNet launcher for SLURM and the MPI substrate
slurm |-| gasnetrun_ofi  GASNet launcher for SLURM and the OFI substrate
slurm-srun               native SLURM launcher
smp                      GASNet launcher for the shared-memory substrate
none                     do not use a launcher
=======================  ==============================================================

A specific launcher can be explicitly requested by setting the
``CHPL_LAUNCHER`` environment variable. For the specific case of the
``mpirun4ofi`` launcher, please see :ref:`readme-libfabric`.

If ``CHPL_LAUNCHER`` is left unset, a default is picked as follows:


* if ``CHPL_COMM`` is gasnet and ``CHPL_COMM_SUBSTRATE`` is udp
  ``CHPL_LAUNCHER`` is set to amudprun

* otherwise, if ``CHPL_TARGET_PLATFORM`` is cray-xc or hpe-cray-ex:

  ==================================  ===================================
  If                                  CHPL_LAUNCHER
  ==================================  ===================================
  both aprun and srun in user's path  none
  aprun in user's path                aprun
  srun in user's path                 slurm-srun
  otherwise                           none
  ==================================  ===================================

* otherwise, if ``CHPL_TARGET_PLATFORM`` is cray-cs and ``CHPL_COMM`` is gasnet and
  salloc is in the user's path:

  =======================  ==============================================
  If                       CHPL_LAUNCHER
  =======================  ==============================================
  CHPL_COMM_SUBSTRATE=ibv  slurm-gasnetrun_ibv
  CHPL_COMM_SUBSTRATE=mpi  slurm-gasnetrun_mpi
  =======================  ==============================================

* otherwise, if ``CHPL_TARGET_PLATFORM`` is cray-cs and srun is in the users path
  ``CHPL_LAUNCHER`` is set to slurm-srun

* otherwise, if ``CHPL_COMM`` is gasnet:

  =======================  ==============================================
  If                       CHPL_LAUNCHER
  =======================  ==============================================
  CHPL_COMM_SUBSTRATE=ibv  gasnetrun_ibv
  CHPL_COMM_SUBSTRATE=mpi  gasnetrun_mpi
  CHPL_COMM_SUBSTRATE=smp  smp
  otherwise                none
  =======================  ==============================================

* otherwise ``CHPL_LAUNCHER`` is set to none

If the launcher binary does not work for your system (due to an
installation-specific configuration, e.g.), you can often use the
``--dry-run`` flag to capture the commands that the launcher would have
executed on your behalf and customize them for your needs.

Forwarding Environment Variables
++++++++++++++++++++++++++++++++

Chapel launchers generally arrange for environment variables to be
forwarded to worker processes. However, this strategy is not always
reliable. The remote system may override some environment variables, and
some launchers might not correctly forward all environment variables.

.. _chpl-rt-masterip:

CHPL_RT_MASTERIP
****************

This environment variable is used to specify the IP address which should be used
to connect.  By default, the node creating the connection will pass the result
of ``gethostname()`` on to the nodes that need to connect to it, which will
resolve that to an IP address using ``gethostbyname()``.

When ``CHPL_COMM == gasnet``, this will also be used to set the value of
``GASNET_MASTERIP``, which corresponds to the hostname of the master node (see
https://gasnet.lbl.gov/dist/udp-conduit/README ).

.. _chpl-rt-workerip:

CHPL_RT_WORKERIP
****************

This environment variable is used to specify the IP address which should be used
to communicate between worker nodes.  By default, worker nodes will communicate
among themselves using the same interface used to connect to the master node
(see :ref:`chpl-rt-masterip`, above).

When ``CHPL_COMM == gasnet``, this will also be used to set the value of
``GASNET_WORKERIP`` (see https://gasnet.lbl.gov/dist/udp-conduit/README ).

.. _using-slurm:

Using Slurm
+++++++++++

To use native Slurm, set:

.. code-block:: sh

  export CHPL_LAUNCHER=slurm-srun

On Cray systems, this will happen automatically if srun is found in your
path, but not when both srun and aprun are found in your path. Native
Slurm is the best option where it works, but at the time of this writing,
there are problems with it when combined with ``CHPL_COMM=gasnet`` and the
UDP or InfiniBand conduits. So, for these configurations please see:

  * :ref:`readme-infiniband` for information about using Slurm with
    InfiniBand.
  * :ref:`using-udp-slurm` for information about using Slurm with the UDP
    conduit

Common Slurm Settings
*********************

* Optionally, you can  specify a node access mode by setting the environment
  variable ``CHPL_LAUNCHER_NODE_ACCESS``. It will default to ``exclusive``
  access, but can be overridden to:

    * ``shared`` to give shared access to nodes
    * ``unset`` to use the system default and not specify a node access mode
    * ``exclusive`` to give exclusive access to nodes (this is the default)

  For example, to grant shared node access, set:

  .. code-block:: bash

    export CHPL_LAUNCHER_NODE_ACCESS=shared

* Optionally, you can specify a slurm partition by setting the environment
  variable ``CHPL_LAUNCHER_PARTITION``. For example, to use the 'debug'
  partition, set:

  .. code-block:: bash

    export CHPL_LAUNCHER_PARTITION=debug

* Optionally, you can specify a slurm nodelist by setting the environment
  variable ``CHPL_LAUNCHER_NODELIST``. For example, to use node nid00001, set:

  .. code-block:: bash

    export CHPL_LAUNCHER_NODELIST=nid00001

* Optionally, you can specify a slurm constraint by setting the environment
  variable ``CHPL_LAUNCHER_CONSTRAINT``. For example, to use nodes with the
  'cal' feature (as defined in the slurm.conf file), set:

  .. code-block:: bash

    export CHPL_LAUNCHER_CONSTRAINT=cal

* Optionally, you can specify a slurm account by setting the environment
  variable ``CHPL_LAUNCHER_ACCOUNT``. For example, to use the account 'acct',
  set:

  .. code-block:: bash

    export CHPL_LAUNCHER_ACCOUNT=acct

* If the environment variable ``CHPL_LAUNCHER_USE_SBATCH`` is defined then
  sbatch is used to launch the job to the queue system, rather than
  running it interactively as usual. In this mode, the output will be
  written by default to a file called <executableName>.<jobID>.out. The
  environment variable ``CHPL_LAUNCHER_SLURM_OUTPUT_FILENAME`` can be used
  to specify a different filename for the output.


.. _ssh-launchers-with-slurm:

Using any SSH-based launcher with Slurm
***************************************

It is possible to use any SSH-based launcher with Slurm, with some additionally
effort. This strategy can come in handy if other launchers are not working.
However, launchers such as `slurm-srun` and `slurm-gasnetrun_ibv` offer a
better experience.

First, let's see how to use an SSH-based launcher with an interactive `salloc`
session. Here we will assume the UDP conduit, but any other launcher supporting
SSH can be configured analogously.

.. code-block:: bash

   # Compile a sample program
   chpl -o hello6-taskpar-dist examples/hello6-taskpar-dist.chpl

   # Reserve 2 nodes for an interactive run
   salloc -N 2
   # Then, within the salloc shell

     # Specify that ssh should be used
     export GASNET_SPAWNFN=S
     # Specify the list of nodes to use; SSH_SERVERS can also be used
     export GASNET_SSH_SERVERS=`scontrol show hostnames | xargs echo`
     # Run the program on the 2 reserved nodes.
     ./hello6-taskpar-dist -nl 2

This strategy can also be used within an *sbatch* script. Here is an
example script to save to the file `job.bash`:

.. code-block:: bash

  #!/bin/bash
  #SBATCH -t 0:10:0
  #SBATCH --nodes=2
  #SBATCH --exclusive
  #SBATCH --partition=chapel
  #SBATCH --output=job.output

  export GASNET_SPAWNFN=S
  export GASNET_SSH_SERVERS=`scontrol show hostnames | xargs echo`

  ./hello6-taskpar-dist -nl 2

To run this job, use:

.. code-block:: bash

  sbatch job.bash

and when it completes, the output will be available in `job.output` as
specified in `job.bash`.

Changing the _real binary suffix
++++++++++++++++++++++++++++++++

In order to support profiling tools that produce new binaries for the
launcher to execute, the suffix of the real binary executed by the
launcher may be changed with the ``CHPL_LAUNCHER_SUFFIX`` environment
variable. If this variable is unset, the suffix defaults to "_real",
matching the compiler's output.


Bypassing the launcher
++++++++++++++++++++++

If the Chapel launcher capability fails you completely, set
``CHPL_LAUNCHER`` to none, recompile, and execute the resulting binary
according to the following rules using tools and queueing mechanisms
appropriate for your system:

* on most systems, the number of locales should be equal to the number
  of nodes on which you execute. That in turn should match the number
  of copies of the program that you are running.

* some queueing systems require you to specify the number of cores to
  use per node. For best results, you will typically want to use all
  of them. All intra-node parallelism is typically implemented using
  Chapel's threading layer (e.g., pthreads), so extra copies of the
  binary are not required per core.

* in our experience, this technique does not work for InfiniBand
  configurations.

Additional launchers
++++++++++++++++++++

In addition to the supported launchers listed above there are several others
that are not actively maintained but may still work.

=============  ==========================================================
Launcher Name  Description
=============  ==========================================================
mpirun         launch using mpirun (no mpi comm currently) 
=============  ==========================================================

.. |-| unicode:: U+2011 .. non-breaking hyphen
  :trim: