Parquet MR

Parquet-mr is the java implementation of the Parquet format to be used in Hadoop. It uses the record shredding and assembly algorithm described in the Dremel paper. Integration with Pig and Map/Reduce are provided.

Apache Pig integration

A Loader and a Storer are provided to read and write Parquet files with Apache Pig

Map/Reduce integration

Thrift

Thrift mapping to the parquet schema is provided using a TBase extending class. You can read and write parquet files using Thrift generated classes.

Create your own objects

The ParquetOutputFormat can be provided a WriteSupport to write your own objects to an event based RecordConsumer.
the ParquetInputFormat can be provided a ReadSupport to materialize your own POJOs by implementing a RecordMaterializer

See the APIs:

Build

to run the unit tests: mvn test

to build the jars: mvn package

The build runs in Travis CI:

Add Parquet as a dependency in Maven

Snapshot releases

  <repositories>
    <repository>
      <id>sonatype-nexus-snapshots</id>
      <url>https://oss.sonatype.org/content/repositories/snapshots</url>
      <releases>
        <enabled>false</enabled>
      </releases>
      <snapshots>
        <enabled>true</enabled>
      </snapshots>
     </repository>
  </repositories>
  <dependencies>
    <dependency>
      <groupId>com.twitter</groupId>
      <artifactId>parquet-column</artifactId>
      <version>1.0.0-SNAPSHOT</version>
    </dependency>
    <dependency>
      <groupId>com.twitter</groupId>
      <artifactId>parquet-hadoop</artifactId>
      <version>1.0.0-SNAPSHOT</version>
    </dependency>
  </dependencies>

Official releases

We haven't published a 1.0.0 yet

Authors and contributors

Julien Le Dem @J_ https://github.com/julienledem
Tom White https://github.com/tomwhite
Avi Bryant https://github.com/avibryant
Dmitriy Ryaboy @squarecog https://github.com/dvryaboy
Jonathan Coveney http://twitter.com/jco

Discussions

google group https://groups.google.com/d/forum/parquet-dev
the group email address: parquet-dev@googlegroups.com

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

rbpark/parquet-mr