HackReduce

http://www.hackreduce.org


Prerequisites
-------------
- Java 1.6+
- Ant
- Git


Run an example job locally
--------------------------

1) git clone git://github.com/hoppertravel/HackReduce.git
   - Note: You should periodically run "git pull" from within the project directory to update your code.
2) cd HackReduce
3) ant
4) Try running an example from the list below


Examples
--------

Run any of the following commands in your CLI, and after the job's completed, check the /tmp/* folder for the output.

Bixi:
> java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.bixi.RecordCounter datasets/bixi /tmp/bixi_recordcounts

NASDAQ:
> java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.HighestDividend datasets/nasdaq/dividends /tmp/nasdaq_dividends
> java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.MarketCapitalization datasets/nasdaq/daily_prices /tmp/nasdaq_marketcaps
> java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.RecordCounter datasets/nasdaq/daily_prices /tmp/nasdaq_recordcounts

NYSE:
> java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.HighestDividend datasets/nyse/dividends /tmp/nyse_dividends
> java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.MarketCapitalization datasets/nyse/daily_prices /tmp/nyse_marketcaps
> java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.RecordCounter datasets/nyse/daily_prices /tmp/nyse_recordcounts

Flights:
> java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.flights.RecordCounter datasets/flights /tmp/flights_recordcounts

Wikipedia:
> java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.wikipedia.RecordCounter datasets/wikipedia /tmp/wikipedia_recordcounts

Note: The jobs are made for the specific datasets, so pairing them up properly is important. The second argument (/tmp/*) is just a made up output path for the results of the job, and can be modified to anything you want.


Datasets
--------
- Bixi (Courtesy of Fabrice)
- Flights (Courtesy of Hopper)
- NASDAQ daily prices and dividends (http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nasdaq-exchange)
- NYSE daily prices and dividends (http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nyse-exchange)
- Wikipedia XML (http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia)

Take a look at the datasets/ folder to see samples subsets of these datasets.


Running on a Hadoop cluster
---------------------------
If you'd like to set up an actual Hadoop cluster (single or multi node), then follow the instructions on the Hadoop wiki: http://wiki.apache.org/hadoop/QuickStart

If you're running the HackReduce virtual image, or you're trying to run the example job against an actual install of Hadoop, the process should be similar. However, you'll need to upload the datasets folder into HDFS (if you're running it), or just make a note of the path.

Example:
> bin/hadoop jar <path_to_hackreduce_jar>/hackreduce-0.1.jar org.hackreduce.examples.stockexchange.RecordCounter <path to>/datasets/nyse/daily_prices /tmp/nyse_recordcounts