Hadoop Example - A naive PageRank implementation for Twitter dataset
I created this project as a Hadoop MapReduce example which is more complicated than counting words, yet simple enough so that it may be easily undestood by a beginner. It is not intended as an actual production-ready Page Rank calculator.
Some features:
- Map Reduce framework - core functionality
- Multiple consecutive jobs - each depending on the previous results
- FileSystem API - for cleaning up intermediate result files
This project is a work in progess, so stay tuned - more stuff is coming up!
- Install Hadoop, see this page for Ubuntu or equivalent for your environment.
- (Optional) Install Hadoop Eclipse plugin, see this page.
- Download the Twitter dataset from here. The archive is several GB large, so this may take a while depending on the situation.
- Create a smaller subset. If you are trying this out on a single machine you probably don't have the time to wait for the whole 26GB to be processed.
You can create a file consisting of only first N lines with
head
command line utility, e.g.head -n 10000000 twitter_rv.net > small.txt
- Change the file paths to correspond to your environment.
- Run the example, either from Eclipse or by generating the jar and submitting it via command line.
- On my laptop it takes about 20 min per each 10 million lines (or ~150 MB), so just wait a little and your results should be there
Copyright (C) 2013 Goran Jovic
Distributed under the Eclipse Public License.