This repository is about programs of hadoop, expecially MapReduce programs. Whole programs have been coded with Python.
Use python(<=2.7.10) and MRJob library
- MRJob library install (perhaps, you need administrator privilege)
> pip install mrjob
We will use that library all the time. So, if you want to get more information, please go to following site.
If you are using Windows, it is recommended to use Enthought Canopy in this lecture. (it looks like IDE for python; provide editor and consoles internally)
Basically, all data have been placed respectively in '/data' directory before execution.
In this project, we will use data got from GroupLens. However, GroupLens data are too large to uploaded to GitHub because of data size. So, Please download from GroupLens or following below direction. Other data could be found /data dir.
> wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
** Large Data **
> wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
- Fake Friends data
data/fakefriends.csv
- Weather data from the year 1800
data/1800.csv
- Book data
data/book.txt
- Order Amount by Customer data
data/customer-orders.csv
- Marvel input data
data/Marvel-Graph.txt data/Marvel-Names.txt
This lecture is dealing with how to make MapReduce(Mapper and Reducer) and implement real requirement like this project; Finding Movie Similarities, how much each movie has similarity with other movies.
Before running these programs, it is recommended to test and look up some small programs described in Test section.
- Finding Movie Similarities
On Local
> python app/MovieSimilarities.py --items=data/ml-100k/u.item data/ml-100k/u.data > dist/moviesimilarities.txt
This program should need 15 mins depanding on your computer spec. Additionally, you can know more commands; with EMR
On EMR of Amazon Web Service
If you want to run this program on EMR, command like below. it will move automatically. First of all, it is important that you have to set up system environment with Amazon Access and Secret keys on your OS before execution.
> python app/MovieSimilarities.py -r emr --items=data/ml-100k/u.item data/ml-100k/u.data > dist/sims.txt
However, it will take more time to be finished then previous one; running on your own computer locally is faster because the spec of your machine seems better. So, we need more options like below. (with 4 instances)
> python app/MovieSimilarities.py -r emr --num-ec2-instances 4 --items=data/ml-100k/u.item data/ml-100k/u.data > dist/sims-4-machines.txt
NOTE
In fact, there are not 4 machines in working; The only 3 machines(Slave) are working, but one of them is Master controlling others. And you should know that 3 machines made final result, so the each result of 3 reducers is not sorted when they are merged. So, you can see the result having 3 sections.
How to Troubleshooting
> python app/MovieSimilarities.py -r emr --num-ec2-instances 4 --items=data/ml-100k/u.ITEM data/ml-100k/u.data > simsfail.txt
Looking above things, we're able to find that there is an error; u.ITEM is not same with u.item because uppercase letters are handled differently with lowercase ones except on Windows(Windows looks these two cases same).
At that time, we can get log from following command. [j-1NXMMBNEQHAFT] can be changed (it is EMR Cluster ID)
> python -m mrjob.tools.emr.fetch_logs --find-failure [j-1NXMMBNEQHAFT]
Scale Up for Large data
We can execute this program for getting results within better time by increasing machines.
python app/MovieSimilarities.py -r emr --num-ec2-instances 20 --items=data/ml-100k/u.item data/ml-100k/u.data > dist/sims-20-machines.txt
- Rating Counter with ml-100k data
> python app/RatingCounter.py data/ml-100k/u.data
- Average Friends by Age with Fake Friends data
> python app/FriendsByAge.py data/fakefriends.csv
> python app/FriendsByAge.py data/fakefriends.csv > dist/friendsbyage.txt
- Finding Temperature Extremes
Get minimum temperature by location
> python app/MinTemperatures.py data/1800.csv > dist/mintemps.txt
Get maximum temperature by location
> python app/MaxTemperatures.py data/1800.csv > dist/maxtemps.txt
- Word Frequency
However, it still make some errors
> python app/WordFrequency.py data/Book.txt > dist/wordcount.txt
Better version, but the order of results is weird
> python app/WordFrequencyBetter.py data/Book.txt > dist/wordcountbetter.txt
Using chaining step method, results are sorted by the values(frequencies)
> python app/WordFrequencySorted.py data/Book.txt > dist/wordcountsorted.txt
- Count Order Amount by Customer (homework)
> python app/CustomerOrderAmount.py data/customer-orders.csv > dist/customerorders.txt
Implemented with sort
> python app/CustomerOrderAmountSorted.py data/customer-orders.csv > dist/customerorderssorted.txt
Implemented with combiner
> python app/CustomerOrderAmountCombiner.py data/customer-orders.csv > dist/customerorderscombiner.txt
- Most Popular Movie
> python app/MostPopularMovie.py data/ml-100k/u.data > dist/mostpopularmovie.txt
Mapping Movie's name with the results
> python app/MostPopularMovieNicer.py --items=data/ml-100k/u.item data/ml-100k/u.data > dist/mostpopularmovienicer.txt
- Most Popular SuperHero
> python app/MostPopularSuperHero.py --names=data/Marvel-Names.txt data/Marvel-Graph.txt > dist/mostpopularsuperhero.txt
- Degrees of SuperHero
First, we need to make BFS of selected SuperHeroID; ex. 2548. Before execution, 'data/BFS-iteration-0.txt' file has to be created for results.
> python app/ProcessMarvel.py 2548
Second, we run this program again and again util getting result.
> python app/BFSIteration.py --target=100 data/BFS-iteration-0.txt > data/BFS-iteration-1.txt
> python app/BFSIteration.py --target=100 data/BFS-iteration-1.txt > data/BFS-iteration-2.txt
...
- Udemy Lecture : Taming Big Data with MapReduce and Hadoop