/spark-hdf5

A plugin to enable Apache Spark to read HDF5 files

Primary LanguageScalaApache License 2.0Apache-2.0

Spark-HDF5 Build Status

Progress

The plugin can read single-dimensional arrays from HDF5 files.

The following types are supported:

  • Int8
  • UInt8
  • Int16
  • UInt16
  • Int32
  • Int64
  • Float32
  • Float64
  • Fixed length strings

Setup

If you are using the sbt-spark-package, the easiest way to use the package is by requiring it from the spark packages website:

spDependencies += "LLNL/spark-hdf5:0.0.4"

Otherwise, download the latest release jar and include it on your classpath.

Usage

import gov.llnl.spark.hdf._

val df = sqlContext.read.hdf5("path/to/file.h5", "/dataset")
df.show

You can start a spark repl with the console target:

sbt console

This will fetch all of the dependencies, set up a local Spark instance, and start a Spark repl with the plugin loaded.

Options

The following options can be set:

Key Default Description
extension h5 The file extension of data
chunk size 10000 The maximum number of elements to be read in a single scan

Testing

The plugin includes a test suite which can be run through SBT

sbt test

Roadmap

Release

This code was developed at the Lawrence Livermore National Lab (LLNL) and is available under the Apache 2.0 license (LLNL-CODE-699384)