Module: HDFS

Support for Hadoop Distributed Filesystem

This module implements support for the Hadoop Distributed Filesystem (HDFS).

Dependencies

The HDFS functionality in Chapel is dependent upon both Hadoop and Java being installed. Your HADOOP_INSTALL, JAVA_INSTALL and CLASSPATH environment variables must be set as described below in Setting up Hadoop. Without this it will not compile with HDFS, even if the flags are set. As well, the HDFS functionality is also dependent upon the CHPL_AUXIO_INCLUDE and CHPL_AUXIO_LIBS environment variables being set properly. For more information on how to set these properly, see doc/technotes/README.auxIO in a Chapel release.

Enabling HDFS Support

Once you have ensured that Hadoop and Java are installed and have the five variables above, defined, set the environment variable CHPL_AUX_FILESYS to ‘hdfs’ to enable HDFS support:

export CHPL_AUX_FILESYS=hdfs

Then, rebuild Chapel by executing ‘make’ from $CHPL_HOME.

make

Note

If HDFS support is not enabled (which is the default), all features described below will compile successfully but will result in an error at runtime such as: “No HDFS Support”.

Using HDFS Support in Chapel

There are three ways provided to open HDFS files within Chapel.

Using an HDFS filesystem with open(url=”hdfs://...”)

// Open a file on HDFS connecting to the default HDFS instance
var f = open(mode=iomode.r, url="hdfs://host:port/path");

// Open up a reader and read from the file
var reader = f.reader();

// ...

reader.close();

f.close();

Explicitly Using Replicated HDFS Connections and Files

use HDFS;

// Connect to HDFS via the default username (or whichever you want)
//
var hdfs = hdfsChapelConnect("default", 0);

//
// Create a file per locale
//
var gfl  = hdfs.hdfsOpen("/user/johnDoe/isThisAfile.txt", iomode.r);

...
//
// On any given locale, you can get the local file for the locale that
// the task is currently running on via:
//
var fl = gfl.getLocal();

// This file can be used as with a traditional file in Chapel, by
// creating reader channels on it.

// When you are done and want to close the files and disconnect from
// HDFS, use:

gfl.hdfsClose();
hdfs.hdfsChapelDisconnect();

Explicitly Using Local HDFS Connections and Files

The HDFS module file also supports non-replicated values across locales. So if you only wanted to connect to HDFS and open a file on locale 1 you could do:

on Locales[1] {
  var hdfs = hdfs_chapel_connect("default", 0);
  var fl = hdfs.hdfs_chapel_open("/user/johnDoe/myFile.txt", iomode.cw);
  ...
  var read = fl.reader();
  ...
  fl.close();
  hdfs.hdfs_chapel_disconnect();
}

The only stipulations are that you cannot open a file in both read and write mode at the same time. (i.e iomode.r and iomode.cw are the only modes that are supported, due to HDFS limitations).

Setting up Hadoop

If you have a working installation of Hadoop already, you can skip this section, other than to set up your CLASSPATH environment variable. This section is written so that people without sudo permission can install and use HDFS. If you do have sudo permissions, you can usually install all of these via a package manager.

The general outline for these instructions is:

  1. Install and point to the jdk to provide code Chapel needs to compile against libhdfs (Step 1)
  2. Install Hadoop (Step 2)
  3. Set up Hadoop on (a) the local host or (b) a cluster of hosts (Step 3)
  4. Start up HDFS (Step 4)
  5. Stop HDFS when you’re done (Step 5)
  6. Set up Chapel to run in distributed mode (Step 6)

First reflect your directory structure and version numbers (etc) in the sample .bashrc and put it in your .bashrc (or .bash_profile – your choice) and source whichever one you put it into.

  1. Make sure you have a SERVER edition of the jdk installed and point JAVA_INSTALL to it (see the same .bashrc below)
  1. Install Hadoop

    • Download the latest version of Hadoop and unpack it

    • Now in the unpacked directory, open conf/hadoop-env.sh and edit:

      • set JAVA_INSTALL to be the part before bin/... when you do:

        which java
        
      • set HADOOP_CLASSPATH=$HADOOP_HOME/""*:$HADOOP_HOME/lib/""*:

    • Now in conf/hdfs-site.xml put the replication number that you want for the field dfs.replication (this will set the replication of blocks of the files in HDFS)

    • Now set up passwordless ssh, if you haven’t yet:

      ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
      cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
      
  1. Set up Hadoop

    1. For the local host - See the Hadoop website for good documentation on how to do this.

    2. For a cluster of hosts. If you want to run Hadoop over a cluster, there are good tutorials online. Although it is usually as easy as making edits to the following files in $HADOOP_HOME/conf:

      • adding the name of the nodes to slaves

      • putting what you want to be the namenode in masters

      • putting the master node in core-site.xml and mapred-site.xml

      • running:

        hadoop-daemon.sh start datanode
        hadoop-daemon.sh start tasktracker
        

      After this go to your datanode site and you should see a new datanode.

      A good online tutorial for this as well can be found here: http://hadoop.apache.org/docs/stable/cluster_setup.html

  1. Start HDFS

    • Now all we need to do is format the namenode and start things up:

      hadoop namenode -format
      start-all.sh  # (This will start hdfs and the tasktracker/jobtracker)
      
    • In general, hadoop has the same type of commands as bash, usually in the form:

      hadoop dfs -<command> <regular args to that command>
      
    • At this point, you can compile and run Chapel programs using HDFS

    • You can check the status of Hadoop via http, for example on a local host (e.g., for 3a above), using:

      For cluster mode (3b), you’ll use the name of the master host in the URL and its port (see the web for details).

  1. Shut things down:

    stop-all.sh   # (This will stop hdfs and mapreduce)
    
  1. Set up Chapel to run in distributed mode:
    • You’ll need to set up your Chapel environment to target multiple locales in the standard way (see README.multilocale and the “Settings to run Chapel on multiple nodes” section of the .bashrc below).
    • After this you should be able to run Chapel code with HDFS over a cluster of computers the same way as you normally would.

Here is a sample .bashrc for using Hadoop within Chapel:

#
# For Hadoop
#
export HADOOP_INSTALL=<Place where you have Hadoop installed>
export HADOOP_HOME=$HADOOP_INSTALL
export HADOOP_VERSION=<Your Hadoop version number>
#
# Note that the following environment variables might contain more paths than
# those listed below if you have more than one IO system enabled. These are all
# that you will need in order to use HDFS (only)
#
export CHPL_AUXIO_INCLUDE="-I$JAVA_INSTALL/include -I$JAVA_INSTALL/include/linux  -I$HADOOP_INSTALL/src/c++/libhdfs"
export CHPL_AUXIO_LIBS="-L$JAVA_INSTALL/jre/lib/amd64/server -L$HADOOP_INSTALL/c++/Linux-amd64-64/lib"

#
# So we can run things such as start-all.sh etc. from anywhere and
# don't need to be in $HADOOP_INSTALL
#
export PATH=$PATH:$HADOOP_INSTALL/bin

#
# Point to the JDK installation
#
export JAVA_INSTALL=<Place where you have the jdk installed>

#
# Add Hadoop directories to the Java class path
#
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/""*:$HADOOP_HOME/lib/""*:$HADOOP_HOME/conf/""*:$(hadoop classpath):

#
# So we don't have to "install" these things
#
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/c++/Linux-amd64-64/lib:$HADOOP_HOME/src/c++/libhdfs:$JAVA_INSTALL/jre/lib/amd64/server:$JAVA_INSTALL:$HADOOP_HOME/lib:$JAVA_INSTALL/jre/lib/amd64:$CLASSPATH

#
# Settings to run Chapel on multiple nodes
#
export GASNET_SPAWNFN=S
export SSH_SERVERS=<the names of the computers in your cluster>
export SSH_CMD=ssh
export SSH_OPTIONS=-x
export GASNET_ROUTE_OUTPUT=0

HDFS Support Types and Functions

record hdfsChapelFile

Holds a file per locale

record hdfsChapelFileSystem

Holds a connection to HDFS per locale

proc hdfsChapelConnect(out error: syserr, path: c_string, port: int): c_void_ptr

Make a connection to HDFS for a single locale

proc hdfsChapelConnect(path: string, port: int): hdfsChapelFileSystem

Connect to HDFS and create a filesystem ptr per locale

proc hdfsChapelFileSystem.hdfsChapelDisconnect()

Diconnect from the configured HDFS filesystem on each locale

proc hdfsChapelFileSystem.hdfsOpen(path: string, mode: iomode, hints: iohints = IOHINT_NONE, style: iostyle = defaultIOStyle()): hdfsChapelFile

Open a file on each locale

proc hdfsChapelFile.hdfsClose()

Close a file opened with hdfsChapelFileSystem.hdfsOpen

proc hdfsChapelFileSystem_local.hdfs_chapel_open(path: string, mode: iomode, hints: iohints = IOHINT_NONE, style: iostyle = defaultIOStyle()): file

Open a file on an HDFS filesystem for a single locale

proc hdfs_chapel_connect(path: string, port: int): hdfsChapelFileSystem_local

Connect to an HDFS filesystem on a single locale