Parquet-mr is the java implementation of the Parquet format to be used in Hadoop. It uses the record shredding and assembly algorithm described in the Dremel paper. Integration with Pig and Map/Reduce are provided.
A Loader and a Storer are provided to read and write Parquet files with Apache Pig
Thrift mapping to the parquet schema is provided using a TBase extending class. You can read and write parquet files using Thrift generated classes.
- The ParquetOutputFormat can be provided a WriteSupport to write your own objects to an event based RecordConsumer.
- the ParquetInputFormat can be provided a ReadSupport to materialize your own POJOs by implementing a RecordMaterializer
See the APIs:
to run the unit tests: mvn test
to build the jars: mvn package
The build runs in Travis CI:
<repositories>
<repository>
<id>sonatype-nexus-snapshots</id>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>com.twitter</groupId>
<artifactId>parquet-column</artifactId>
<version>1.0.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>com.twitter</groupId>
<artifactId>parquet-hadoop</artifactId>
<version>1.0.0-SNAPSHOT</version>
</dependency>
</dependencies>
We haven't published a 1.0.0 yet
- Julien Le Dem @J_ https://github.com/julienledem
- Tom White https://github.com/tomwhite
- Avi Bryant https://github.com/avibryant
- Dmitriy Ryaboy @squarecog https://github.com/dvryaboy
- Jonathan Coveney http://twitter.com/jco
- google group https://groups.google.com/d/forum/parquet-dev
- the group email address: parquet-dev@googlegroups.com
Copyright 2012 Twitter, Inc.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0