HackReduce http://www.hackreduce.org Prerequisites ------------- - Java 1.6+ - Ant - Git Run an example job locally -------------------------- 1) git clone git://github.com/hoppertravel/HackReduce.git - Note: You should periodically run "git pull" from within the project directory to update your code. 2) cd HackReduce 3) ant 4) Try running an example from the list below Examples -------- Run any of the following commands in your CLI, and after the job's completed, check the /tmp/* folder for the output. Bixi: > java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.bixi.RecordCounter datasets/bixi /tmp/bixi_recordcounts NASDAQ: > java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.HighestDividend datasets/nasdaq/dividends /tmp/nasdaq_dividends > java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.MarketCapitalization datasets/nasdaq/daily_prices /tmp/nasdaq_marketcaps > java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.RecordCounter datasets/nasdaq/daily_prices /tmp/nasdaq_recordcounts NYSE: > java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.HighestDividend datasets/nyse/dividends /tmp/nyse_dividends > java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.MarketCapitalization datasets/nyse/daily_prices /tmp/nyse_marketcaps > java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.stockexchange.RecordCounter datasets/nyse/daily_prices /tmp/nyse_recordcounts Flights: > java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.flights.RecordCounter datasets/flights /tmp/flights_recordcounts Wikipedia: > java -classpath ".:dist/hackreduce-0.1.jar:lib/*" org.hackreduce.examples.wikipedia.RecordCounter datasets/wikipedia /tmp/wikipedia_recordcounts Note: The jobs are made for the specific datasets, so pairing them up properly is important. The second argument (/tmp/*) is just a made up output path for the results of the job, and can be modified to anything you want. Datasets -------- - Bixi (Courtesy of Fabrice) - Flights (Courtesy of Hopper) - NASDAQ daily prices and dividends (http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nasdaq-exchange) - NYSE daily prices and dividends (http://www.infochimps.com/datasets/daily-1970-2010-open-close-hi-low-and-volume-nyse-exchange) - Wikipedia XML (http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia) Take a look at the datasets/ folder to see samples subsets of these datasets. Running on a Hadoop cluster --------------------------- If you'd like to set up an actual Hadoop cluster (single or multi node), then follow the instructions on the Hadoop wiki: http://wiki.apache.org/hadoop/QuickStart If you're running the HackReduce virtual image, or you're trying to run the example job against an actual install of Hadoop, the process should be similar. However, you'll need to upload the datasets folder into HDFS (if you're running it), or just make a note of the path. Example: > bin/hadoop jar <path_to_hackreduce_jar>/hackreduce-0.1.jar org.hackreduce.examples.stockexchange.RecordCounter <path to>/datasets/nyse/daily_prices /tmp/nyse_recordcounts