HDFS

Usage

use HDFS;

or

import HDFS;

Support for the Hadoop Distributed File System.

This module implements support for the Hadoop Distributed Filesystem (HDFS).

Note

HDFS support in Chapel currently requires the use of CHPL_TASKS=fifo. There is a compatibility problem with qthreads.

Using HDFS Support in Chapel

To open an HDFS file in Chapel, first create an HDFSFileSystem by connecting to an HDFS name node.

import HDFS;

var fs = HDFS.connect(); // can pass a nameNode host and port here,
                         // otherwise uses HDFS default settings.

The filesystem connection will be closed when fs and any files it refers to go out of scope.

Once you have a hdfs, you can open files within that filesystem using HDFSFileSystem.open and perform I/O on them using the usual functionality in the IO module:

var f = fs.open("/tmp/testfile.txt", ioMode.cw);
var writer = f.writer();
writer.writeln("This is a test");
writer.close();
f.close();

Note

Please note that ioMode.cwr and ioMode.rw are not supported with HDFS files due to limitations in HDFS itself. ioMode.r and ioMode.cw are the only modes supported with HDFS.

Dependencies

Please refer to the Hadoop and HDFS documentation for instructions on setting up HDFS.

Once you have a working HDFS, it’s a good idea to test your HDFS installation with a C program before proceeding with Chapel HDFS support. Try compiling the below C program:

// hdfs-test.c

#include <hdfs.h>

#include <string.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {

    hdfsFS fs = hdfsConnect("default", 0);
    const char* writePath = "/tmp/testfile.txt";
    hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY|O_CREAT, 0, 0, 0);
    if(!writeFile) {
          fprintf(stderr, "Failed to open %s for writing!\n", writePath);
          exit(-1);
    }
    char* buffer = "Hello, World!";
    tSize num_written_bytes = hdfsWrite(fs, writeFile, (void*)buffer, strlen(buffer)+1);
    if (hdfsFlush(fs, writeFile)) {
           fprintf(stderr, "Failed to 'flush' %s\n", writePath);
          exit(-1);
    }
   hdfsCloseFile(fs, writeFile);
}

This program will probably not compile without some special environment variables set. The following commands worked for us to compile this program, but you will almost certainly need different settings depending on your HDFS installation.

export JAVA_HOME=/usr/lib/jvm/default-java/lib
export HADOOP_HOME=/usr/local/hadoop/
gcc hdfs-test.c -I$HADOOP_HOME/include -L$HADOOP_HOME/lib/native -lhdfs
export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native:$JAVA_HOME/lib
./a.out

# verify that the new test file was created
$HADOOP_HOME/bin/hdfs dfs  -ls /tmp

HDFS Support Types and Functions

proc connect(nameNode: string = "default", port: int = 0) throws

Connect to an HDFS filesystem. If nameNode or port are not provided, the HDFS defaults will be used.

Arguments:
  • nameNode – the hostname for an HDFS name node to connect to

  • port – the port on which the HDFS service is running on the name node

Returns:

a hdfs representing the connected filesystem.

record hdfs

Record storing an open HDFS filesystem. Please see HDFSFileSystem for the forwarded methods available, in particular HDFSFileSystem.open.

class HDFSFileSystem

Class representing a connected HDFS file system. This connected is reference counted and shared by open files.

proc open(path: string, mode: ioMode, in flags: c_int = 0, bufferSize: c_int = 0, replication: c_short = 0, blockSize: tSize = 0) throws

Open an HDFS file stored at a particular path. Note that once the file is open, you will need to use IO.file.reader or IO.file.writer to create a channel to actually perform I/O operations.

Arguments:
  • path – which file to open (for example, “some/file.txt”).

  • ioMode – specify whether to open the file for reading or writing and whether or not to create the file if it doesn’t exist. See IO.ioMode.

  • flags – flags to pass to the HDFS open call. Uses flags appropriate for mode if not provided.

  • bufferSize – buffer size to pass to the HDFS open call. Uses the HDFS default value if not provided.

  • replication – replication factor to pass to the HDFS open call. Uses the HDFS default value if not provided.

  • blockSize – blockSize to pass to the HDFS open call. Uses the HDFS default value if not provided.