Module: HDFS¶
Support for Hadoop Distributed Filesystem
This module implements support for the Hadoop Distributed Filesystem (HDFS).
Dependencies¶
The HDFS functionality in Chapel is dependent upon both Hadoop and Java being installed. Your HADOOP_INSTALL, JAVA_INSTALL and CLASSPATH environment variables must be set as described below in Setting up Hadoop. Without this it will not compile with HDFS, even if the flags are set. As well, the HDFS functionality is also dependent upon the CHPL_AUXIO_INCLUDE and CHPL_AUXIO_LIBS environment variables being set properly. For more information on how to set these properly, see doc/technotes/README.auxIO in a Chapel release.
Enabling HDFS Support¶
Once you have ensured that Hadoop and Java are installed and have the five variables above, defined, set the environment variable CHPL_AUX_FILESYS to ‘hdfs’ to enable HDFS support:
export CHPL_AUX_FILESYS=hdfs
Then, rebuild Chapel by executing ‘make’ from $CHPL_HOME.
make
Note
If HDFS support is not enabled (which is the default), all features described below will compile successfully but will result in an error at runtime such as: “No HDFS Support”.
Using HDFS Support in Chapel¶
There are three ways provided to open HDFS files within Chapel.
Using an HDFS filesystem with open(url=”hdfs://...”)¶
// Open a file on HDFS connecting to the default HDFS instance
var f = open(mode=iomode.r, url="hdfs://host:port/path");
// Open up a reader and read from the file
var reader = f.reader();
// ...
reader.close();
f.close();
Explicitly Using Replicated HDFS Connections and Files¶
use HDFS;
// Connect to HDFS via the default username (or whichever you want)
//
var hdfs = hdfsChapelConnect("default", 0);
//
// Create a file per locale
//
var gfl = hdfs.hdfsOpen("/user/johnDoe/isThisAfile.txt", iomode.r);
...
//
// On any given locale, you can get the local file for the locale that
// the task is currently running on via:
//
var fl = gfl.getLocal();
// This file can be used as with a traditional file in Chapel, by
// creating reader channels on it.
// When you are done and want to close the files and disconnect from
// HDFS, use:
gfl.hdfsClose();
hdfs.hdfsChapelDisconnect();
Explicitly Using Local HDFS Connections and Files¶
The HDFS module file also supports non-replicated values across locales. So if you only wanted to connect to HDFS and open a file on locale 1 you could do:
on Locales[1] {
var hdfs = hdfs_chapel_connect("default", 0);
var fl = hdfs.hdfs_chapel_open("/user/johnDoe/myFile.txt", iomode.cw);
...
var read = fl.reader();
...
fl.close();
hdfs.hdfs_chapel_disconnect();
}
The only stipulations are that you cannot open a file in both read and write mode at the same time. (i.e iomode.r and iomode.cw are the only modes that are supported, due to HDFS limitations).
Setting up Hadoop¶
If you have a working installation of Hadoop already, you can skip this section, other than to set up your CLASSPATH environment variable. This section is written so that people without sudo permission can install and use HDFS. If you do have sudo permissions, you can usually install all of these via a package manager.
The general outline for these instructions is:
First reflect your directory structure and version numbers (etc) in the sample .bashrc and put it in your .bashrc (or .bash_profile – your choice) and source whichever one you put it into.
- Make sure you have a SERVER edition of the jdk installed and point JAVA_INSTALL to it (see the same .bashrc below)
Install Hadoop
Download the latest version of Hadoop and unpack it
Now in the unpacked directory, open conf/hadoop-env.sh and edit:
set JAVA_INSTALL to be the part before bin/... when you do:
which java
set HADOOP_CLASSPATH=$HADOOP_HOME/""*:$HADOOP_HOME/lib/""*:
Now in conf/hdfs-site.xml put the replication number that you want for the field dfs.replication (this will set the replication of blocks of the files in HDFS)
Now set up passwordless ssh, if you haven’t yet:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Set up Hadoop
For the local host - See the Hadoop website for good documentation on how to do this.
For a cluster of hosts. If you want to run Hadoop over a cluster, there are good tutorials online. Although it is usually as easy as making edits to the following files in $HADOOP_HOME/conf:
adding the name of the nodes to slaves
putting what you want to be the namenode in masters
putting the master node in core-site.xml and mapred-site.xml
running:
hadoop-daemon.sh start datanode hadoop-daemon.sh start tasktracker
After this go to your datanode site and you should see a new datanode.
A good online tutorial for this as well can be found here: http://hadoop.apache.org/docs/stable/cluster_setup.html
Start HDFS
Now all we need to do is format the namenode and start things up:
hadoop namenode -format start-all.sh # (This will start hdfs and the tasktracker/jobtracker)
In general, hadoop has the same type of commands as bash, usually in the form:
hadoop dfs -<command> <regular args to that command>
At this point, you can compile and run Chapel programs using HDFS
You can check the status of Hadoop via http, for example on a local host (e.g., for 3a above), using:
For cluster mode (3b), you’ll use the name of the master host in the URL and its port (see the web for details).
Shut things down:
stop-all.sh # (This will stop hdfs and mapreduce)
- Set up Chapel to run in distributed mode:
- You’ll need to set up your Chapel environment to target multiple locales in the standard way (see README.multilocale and the “Settings to run Chapel on multiple nodes” section of the .bashrc below).
- After this you should be able to run Chapel code with HDFS over a cluster of computers the same way as you normally would.
Here is a sample .bashrc for using Hadoop within Chapel:
#
# For Hadoop
#
export HADOOP_INSTALL=<Place where you have Hadoop installed>
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_VERSION=<Your Hadoop version number>
#
# Note that the following environment variables might contain more paths than
# those listed below if you have more than one IO system enabled. These are all
# that you will need in order to use HDFS (only)
#
export CHPL_AUXIO_INCLUDE="-I$JAVA_INSTALL/include -I$JAVA_INSTALL/include/linux -I$HADOOP_INSTALL/src/c++/libhdfs"
export CHPL_AUXIO_LIBS="-L$JAVA_INSTALL/jre/lib/amd64/server -L$HADOOP_INSTALL/c++/Linux-amd64-64/lib"
#
# So we can run things such as start-all.sh etc. from anywhere and
# don't need to be in $HADOOP_INSTALL
#
export PATH=$PATH:$HADOOP_INSTALL/bin
#
# Point to the JDK installation
#
export JAVA_INSTALL=<Place where you have the jdk installed>
#
# Add Hadoop directories to the Java class path
#
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/""*:$HADOOP_HOME/lib/""*:$HADOOP_HOME/conf/""*:$(hadoop classpath):
#
# So we don't have to "install" these things
#
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/c++/Linux-amd64-64/lib:$HADOOP_HOME/src/c++/libhdfs:$JAVA_INSTALL/jre/lib/amd64/server:$JAVA_INSTALL:$HADOOP_HOME/lib:$JAVA_INSTALL/jre/lib/amd64:$CLASSPATH
#
# Settings to run Chapel on multiple nodes
#
export GASNET_SPAWNFN=S
export SSH_SERVERS=<the names of the computers in your cluster>
export SSH_CMD=ssh
export SSH_OPTIONS=-x
export GASNET_ROUTE_OUTPUT=0
HDFS Support Types and Functions¶
- record hdfsChapelFile¶
Holds a file per locale
- record hdfsChapelFileSystem¶
Holds a connection to HDFS per locale
- proc hdfsChapelConnect(out error: syserr, path: c_string, port: int): c_void_ptr¶
Make a connection to HDFS for a single locale
- proc hdfsChapelConnect(path: string, port: int): hdfsChapelFileSystem
Connect to HDFS and create a filesystem ptr per locale
- proc hdfsChapelFileSystem.hdfsChapelDisconnect()¶
Diconnect from the configured HDFS filesystem on each locale
- proc hdfsChapelFileSystem.hdfsOpen(path: string, mode: iomode, hints: iohints = IOHINT_NONE, style: iostyle = defaultIOStyle()): hdfsChapelFile¶
Open a file on each locale
- proc hdfsChapelFile.hdfsClose()¶
Close a file opened with hdfsChapelFileSystem.hdfsOpen
- proc hdfsChapelFileSystem_local.hdfs_chapel_open(path: string, mode: iomode, hints: iohints = IOHINT_NONE, style: iostyle = defaultIOStyle()): file¶
Open a file on an HDFS filesystem for a single locale
- proc hdfs_chapel_connect(path: string, port: int): hdfsChapelFileSystem_local¶
Connect to an HDFS filesystem on a single locale