Auxiliary I/O Systems¶
This document describes Chapel support for Auxiliary I/O (AIO) systems. It also provides instructions on how to set Chapel up to support multiple Auxiliary I/O systems simultaneously.
Setting up HDFS¶
HDFS is the Hadoop Distributed Filesystem. This section demonstrates how to set up a Hadoop installation. If you already have access to an HDFS filesystem, you can skip ahead to Enabling HDFS Support.
HDFS functionality in Chapel is dependent Hadoop being
environment variables must be set as described below.
Without this it will not compile with
HDFS, even if
the flags are set. As well, the
HDFS functionality is also dependent upon the
CHPL_AUXIO_LIBS environment variables being set
If you have a working installation of Hadoop already, you can skip
this section, other than to set up your
variable. This section is written so that people without sudo
permission can install and use
HDFS. If you do have sudo permissions,
you can usually install all of these via a package manager.
The general outline for these instructions are:
First reflect your directory structure and version numbers (etc) in the sample .bashrc and put it in your .bashrc (or other shell rc file of your choice) and source whichever one you put it into.
- Make sure you have a SERVER edition of the jdk installed and
JAVA_INSTALLto it (see the sample .bashrc below)
Download the latest version of Hadoop and unpack it
Now in the unpacked directory, open
JAVA_INSTALLto be the part before
bin/when you do:
Now in conf/hdfs-site.xml put the replication number that you want for the field
dfs.replication(this will set the replication of blocks of the files in HDFS)
Now set up passwordless ssh, if you haven't yet:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Set up Hadoop
For the local host - See the Hadoop website for good documentation on how to do this.
For a cluster of hosts. If you want to run Hadoop over a cluster, there are good tutorials online. Although it is usually as easy as making edits to the following files in
adding the name of the nodes to
putting what you want to be the namenode in
putting the master node in
hadoop-daemon.sh start datanode hadoop-daemon.sh start tasktracker
After this go to your datanode site and you should see a new datanode.
A good online tutorial for this as well can be found here on the Hadoop Cluster Setup Documentation
Now all we need to do is format the namenode and start things up:
hadoop namenode -format start-all.sh # (This will start hdfs and the tasktracker/jobtracker)
In general, hadoop has the same type of commands as bash, usually in the form:
hadoop dfs -<command> <regular args to that command>
At this point, you can compile and run Chapel programs using HDFS
You can check the status of Hadoop via http, for example on a local host (e.g., for 3a above), using:
For cluster mode (3b), you'll use the name of the master host in the URL and its port (see the web for details).
Shut things down:
stop-all.sh # (This will stop hdfs and mapreduce)
- Set up Chapel to run in distributed mode:
- You'll need to set up your Chapel environment to target multiple locales in the standard way (see Multilocale Chapel Execution and the "Settings to run Chapel on multiple nodes" section of the Sample .bashrc below).
- After this you should be able to run Chapel code with HDFS over a cluster of computers the same way as you normally would.
Here is a sample .bashrc for using Hadoop within Chapel:
# # For Hadoop # export HADOOP_INSTALL=<Place where you have Hadoop installed> export HADOOP_HOME=$HADOOP_INSTALL export HADOOP_VERSION=<Your Hadoop version number> # # Note that the following environment variables might contain more paths than # those listed below if you have more than one IO system enabled. These are all # that you will need in order to use HDFS (only) # export CHPL_AUXIO_INCLUDE="-I$JAVA_INSTALL/include -I$JAVA_INSTALL/include/linux -I$HADOOP_INSTALL/src/c++/libhdfs" export CHPL_AUXIO_LIBS="-L$JAVA_INSTALL/jre/lib/amd64/server -L$HADOOP_INSTALL/c++/Linux-amd64-64/lib" # # So we can run things such as start-all.sh etc. from anywhere and # don't need to be in $HADOOP_INSTALL # export PATH=$PATH:$HADOOP_INSTALL/bin # # Point to the JDK installation # export JAVA_INSTALL=<Place where you have the jdk installed> # # Add Hadoop directories to the Java class path # export CLASSPATH=$CLASSPATH:$HADOOP_HOME/""*:$HADOOP_HOME/lib/""*:$HADOOP_HOME/conf/""*:$(hadoop classpath): # # So we don't have to "install" these things # export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/c++/Linux-amd64-64/lib:$HADOOP_HOME/src/c++/libhdfs:$JAVA_INSTALL/jre/lib/amd64/server:$JAVA_INSTALL:$HADOOP_HOME/lib:$JAVA_INSTALL/jre/lib/amd64:$CLASSPATH # # Settings to run Chapel on multiple nodes # export GASNET_SPAWNFN=S export SSH_SERVERS=<the names of the computers in your cluster> export SSH_CMD=ssh export SSH_OPTIONS=-x export GASNET_ROUTE_OUTPUT=0
Enabling HDFS Support¶
There are two ways to configure Chapel to work with HDFS: using the Java implementation with libhdfs; or using a C/C++ implementation with libhdfs3.
The user should set their
# C/C++ implementation export CHPL_AUX_FILESYS=hdfs3
# Java implementation. Also set environment variables noted above. export CHPL_AUX_FILESYS=hdfs
Then, rebuild Chapel by executing
If HDFS support is not enabled (which is the default), all
features described in
HDFS will compile successfully but will result
in an error at runtime such as: "No HDFS Support".
Installing Curl Dependencies¶
The environment variables
be set to point to the include and lib directories for libcurl respectively.
If libcurl is installed system-wide you should not need to set these variables.
Enabling Curl Support¶
Once you have ensured that libcurl is installed, and have the two variables
above defined, set the environment variable
CHPL_AUX_FILESYS to 'curl' to
Then, rebuild Chapel by executing
make`' from ``$CHPL_HOME:
If Curl support is not enabled (which is the default), all features described below will compile successfully but will result in an error at runtime, saying: "No Curl Support".
The AIO system depends upon three environment variables:
In the following sections, we will explain what they should be set to, and give the general idea of what they do.
CHPL_AUXIO_INCLUDE & CHPL_AUXIO_LIBS¶
These paths are for the extra libraries that will be linked in with the runtime
when it is compiled. For instance, if I installed libcurl, and had it install in
~/lib you would set them to be:
export CHPL_AUXIO_LIBS="-L~/include" export CHPL_AUXIO_INCLUDE="-I~/lib"
In general, you want it so that if you had a .c file that used the libraries that you wish to compile Chapel with, all you would need to do to compile this file would be:
cc $CHPL_AUXIO_LIBS $CHPL_AUXIO_INCLUDE <any libraries> <the .c file>
where <any libraries> might be
It is not necessary to pass these library flags, or library/include paths
to the Chapel compiler invocations (chpl) as the values in
CHPL_AUXIO_INCLUDE will be used there as well as in building the
Assuming that you have correctly defined
CHPL_AUXIO_LIBS as detailed above, and have the correct libraries
If you only have one AIO system that you wish to use, you may simply set
CHPL_AUX_FILESYS=<system>. For example, if we only wanted Apache Hadoop
HDFS support, we would set:
Parallel and Distributed I/O Features¶
We support two functions for Parallel and Distributed file systems (these also work on "standard" file systems as well).
file.getchunk(start:int(64), end:int(64)):(int(64), int(64))
This returns the first logical chunk of the file that is inside this section. If no chunk can be found inside this region, (0,0) is returned. If no arguments are provided, we return the start and end of the first logical chunk for this file.
- On Lustre, this returns the first stripe for the file that is inside this region.
- On HDFS, this returns the first block for the file that is inside this region.
- On local file systems, it returns the first optimal transfer block (from fstatfs) inside this section of the file.
This returns the best locales for a given chunk of the file. If no individual or set of locales are best (i.e., there is some sort of data affinity that we can exploit), we return all locales.
- On Lustre, no locale are best, so we return all locales
- On HDFS, we return the block owners for that specific block
- On local file systems, we return all locales, since no individual locale is best.