/ARCInputFormat

Packages the ARCInputFormat used in Common Crawl in a small jar file that can be used in MapReduce jobs. Implements HdfsARCSource. See README for details

Primary LanguageJavaApache License 2.0Apache-2.0

This project extracts from the original commoncrawl project only the ARCInputFormat class and its dependencies. It also implement a new ARCSource, HDFSSource, which allows ARC files to be read from HDFS.

Differences from the original project:

How to compile

In order to ensure a successful compilation of the library please modify the build.proprieties file and set the hadoop.path variable correctly. Then simply invoke:

ant

You'll find ARCInputFormat.jar ready for use.