H5spark, 2016

Supporting Hierarchical Data Format, HDF5/NetCDF4 and Rich Parallel I/O Interface in Spark
Optimizing I/O Performance on Cray Machine with Lustre Filesystems

Input and RDD Format

For reading multiple files, Input is "A csv file that lists file path and variable name", e.g., src/resources/hdf5/scalafilelist
For reading single file, Input is "A csv file that lists file path, variable name, and start, offset", e.g., src/resources/hdf5/
Output is "A single RDD in which the element of RDD is one row in original file"

#Sample Batch Job on Cori

#Use in Your Pyspark Scripts: Add this to your python path: export PYTHONPATH= path/to/h5spark/src/main/python/:$PYTHONPATH

Then import it in python like so:

#Download and Compile H5Spark:

git pull https://github.com/valiantljk/h5spark.git
cd h5spark
sbt package
cp target/scala-2.10/h5spark_2.10-1.0.jar lib/
cp -r lib/ your_project_dir/ (if you already have a lib directory, then just copy everything in h5spark/lib/* to your lib/)

#Use H5spark in your Scala Codes

export LD_LIBRARY_PATH=$LD_LBRARY_PATH:your_project_dir/lib
add these lines in your codes: import org.nersc.io._
then you have a few options to load the data
the inputpath can be an absolute path of a single large HDF5 file, can also be a path to multiple small HDF5 files, e.g, a directory that contains millions of files

** Load as an indexedmatrix: val tempmat = read.h5read_imat (sc,inputpath, variablename, partition)

** Load as an indexedrow: val tempmat = read.h5read_irow (sc,inputpath, variablename, partition)

** Load as an array: val tempmat = read.h5read (sc,inputpath, variablename, partition)