Spark-HDF5

Progress

The plugin can read single-dimensional arrays from HDF5 files.

The following types are supported:

If you are using the sbt-spark-package, the easiest way to use the package is by requiring it from the spark packages website:

spDependencies += "LLNL/spark-hdf5:0.0.4"

Otherwise, download the latest release jar and include it on your classpath.

import gov.llnl.spark.hdf._

val df = sqlContext.read.hdf5("path/to/file.h5", "/dataset")
df.show

You can start a spark repl with the console target:

sbt console

This will fetch all of the dependencies, set up a local Spark instance, and start a Spark repl with the plugin loaded.

The following options can be set:

Key	Default	Description
`extension`	`h5`	The file extension of data
`chunk size`	`10000`	The maximum number of elements to be read in a single scan

The plugin includes a test suite which can be run through SBT

sbt test

This code was developed at the Lawrence Livermore National Lab (LLNL) and is available under the Apache 2.0 license (LLNL-CODE-699384)