/NetApp-Hadoop-NFS-Connector

This projects provides a NFSv3 connector for Hadoop. Using the connector, Apache Hadoop and Apache Spark can use NFSv3 server as their storage backend.

Primary LanguageJava

NetApp-Hadoop-NFS-Connector

This site is obsolete. Please use NetApp FAS NFSConnector product from now.netapp.com

Overview

The Hadoop NFS Connector allows Apache Hadoop (2.2+) and Apache Spark (1.2+) to use a NFSv3 storage server as a storage endpoint. The NFS Connector supports two modes: (1) secondary filesystem - where Hadoop/Spark runs using HDFS as its primary storage and can use NFS as a second storage endpoint, and (2) primary filesystem - where Hadoop/Spark runs entirely on a NFSv3 storage server.

The code is written in a way such that existing applications do not have to change. All one has to do is to copy the connector jar into the lib/ directory of Hadoop/Spark. Then, modify core-site.xml to provide the necessary details.

NOTE: The code is in beta. We would love for you to try it out and give us feedback.

This is the first release and it does the following:

  • Connects to a NFSv3 storage server supporting AUTH_NONE or AUTH_SYS authentication method.
  • Works with Apache Hadoop (vanilla) 2.2 or newer, Hortonworks HDP 2.2 or newer
  • Supports all operations defined by the Hadoop FileSystem interface.
  • Pipelines the READ/WRITE requests to utilize the underlying network (works fine with 1GbE and 10GbE networks)

We are planning to add these in the near future:

  • Ability to connect to multiple NFS endpoints (multiple IP addresses). This allows for even more bandwidth.
  • Integrate with Hadoop user authentication

How to use

Once the NFS connector is configured, you can easily invoke it from the command-line using the Hadoop shell.

  console> bin/hadoop fs -ls nfs://<nfs-server-hostname>:2049/ (if using as secondary filesystem)
  console> bin/hadoop fs -ls / (if using as default/primary filesystem)

When new jobs are submitted, you can simply provide it as an input or output path or both:

  (assuming NFS is used as a secondary filesystem)
  console> bin/hadoop jar <path-to-examples> jar terasort nfs://<nfs-server-hostname>:2049/tera/in /tera/out
  console> bin/hadoop jar <path-to-examples> jar terasort /tera/in nfs://<nfs-server-hostname>:2049/tera/out
  console> bin/hadoop jar <path-to-examples> jar terasort nfs://<nfs-server-hostname>:2049/tera/in nfs://<nfs-server-hostname>:2049/tera/out

Configuration

  1. Compile the project ``` console> mvn clean package ```
  2. Copy the jar file to the shared common library directory based on your Hadoop installation. For example, for hadoop-2.4.1: ``` console> cp target/hadoop-connector-nfsv3-1.0.jar $HADOOP_HOME/share/hadoop/common/lib/ ```
  3. Add parameters of NFSv3 connector into core-site.xml located in Hadoop configuration directory (e.g., for hadoop-2.4.1: $HADOOP_HOME/conf) ``` fs.defaultFS nfs://:2049 fs.nfs.configuration /nfs-mapping.json fs.nfs.impl org.apache.hadoop.fs.nfs.NFSv3FileSystem fs.AbstractFileSystem.nfs.impl org.apache.hadoop.fs.nfs.NFSv3AbstractFilesystem ```
  4. Start Hadoop. NFS can now be used inside Hadoop.