- Supporting Hierarchical Data Format, HDF5/NetCDF4 and Rich Parallel I/O Interface in Spark
- Optimizing I/O Performance on Cray Machine with Lustre Filesystems
- For reading multiple files, Input is "A csv file that lists file path and variable name", e.g., src/resources/hdf5/scalafilelist
- For reading single file, Input is "A csv file that lists file path, variable name, and start, offset", e.g., src/resources/hdf5/
- Output is "A single RDD in which the element of RDD is one row in original file"
#Sample Batch Job on Cori
- Python version: sbatch spark-python.sh
- Scala version: sbatch spark-scala.sh
#Use in Your Pyspark Scripts: Add this to your python path: export PYTHONPATH= path/to/h5spark/src/main/python/:$PYTHONPATH
Then import it in python like so:
- from h5spark import read
- from pyspark import SparkContext
- sc = SparkContext()
- rdd = read.readH5(sc,('path/to/h5file', 'dataset_name'))
#Download and Compile H5Spark:
- git pull https://github.com/valiantljk/h5spark.git
- cd h5spark
- sbt package
- cp target/scala-2.10/h5spark_2.10-1.0.jar lib/
- cp -r lib/ your_project_dir/ (if you already have a lib directory, then just copy everything in h5spark/lib/* to your lib/)
#Use H5spark in your Scala Codes
- export LD_LIBRARY_PATH=$LD_LBRARY_PATH:your_project_dir/lib
- add these lines in your codes: import org.nersc.io._
- then you have a few options to load the data
- the inputpath can be an absolute path of a single large HDF5 file, can also be a path to multiple small HDF5 files, e.g, a directory that contains millions of files
** Load as an indexedmatrix: val tempmat = read.h5read_imat (sc,inputpath, variablename, partition)
** Load as an indexedrow: val tempmat = read.h5read_irow (sc,inputpath, variablename, partition)
** Load as an array: val tempmat = read.h5read (sc,inputpath, variablename, partition)