Apache DataFu
Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by the need for stable, well-tested libraries for data mining and statistics.
It consists of two libraries:
- Apache DataFu Pig: a collection of user-defined functions for Apache Pig
- Apache DataFu Hourglass: an incremental processing framework for Apache Hadoop in MapReduce
For more information please visit the website:
If you'd like to jump in and get started, check out the corresponding guides for each library:
Blog Posts
- Introducing DataFu
- DataFu: The WD-40 of Big Data
- DataFu 1.0
- DataFu's Hourglass: Incremental Data Processing in Hadoop
Presentations
- A Brief Tour of DataFu
- Building Data Products at LinkedIn with DataFu
- Hourglass: a Library for Incremental Processing on Hadoop (IEEE BigData 2013)
Papers
Getting Help
Bugs and feature requests can be filed here. For other help please see the website.
Developers
Building the Code
To build DataFu from a git checkout or binary release, run:
./gradlew clean assemble
To build DataFu from a source release, it is first necessary to download the gradle wrapper script above. This bootstrapping process requires Gradle to be installed on the source machine. Gradle is available through most package managers or directly from its website. To bootstrap the wrapper, run:
gradle -b bootstrap.gradle
After the bootstrap script has completed, the regular gradlew instructions are available.
The datafu-pig JAR can be found under datafu-pig/build/libs
by the name datafu-pig-x.y.z.jar
, where x.y.z is the version. Similarly, the datafu-hourglass can be found in the datafu-hourglass/build/libs
directory.
Generating Eclipse Files
This command generates the eclipse project and classpath files:
./gradlew eclipse
To clean up the eclipse files:
./gradlew cleanEclipse
Running the Tests
To run all the tests:
./gradlew test
To run only the DataFu Pig tests:
./gradlew :datafu-pig:test
To run only the DataFu Hourglass tests:
./gradlew :datafu-hourglass:test
To run tests for a single class, use the test.single
property. For example, to run only the QuantileTests:
./gradlew :datafu-pig:test -Dtest.single=QuantileTests
The tests can also be run from within eclipse. Note that you may run out of heap when executing tests in Eclipse. To fix this adjust your heap settings for the TestNG plugin. Go to Eclipse->Preferences. Select TestNG->Run/Debug. Add "-Xmx1G" to the JVM args.