moviedata-statistics: A Java repository from vivekvenkatesh

============================================
MapReduce program to analyze a Movie Dataset
============================================

There are 3 datafiles :: movies.dat, ratings.dat, users.dat.

Write an efficient map reduce program to do the following:

Q1. Find all the user ids who has rated at least n movies. (n=30 or 40) [use only ratings.dat as input file]

Q2. Given some movie titles - find all the genres of the movies. (Taking the movie titles as command line input is a must here) [use only movies.dat as input file]

Q3. Find the top 10 zipcodes based on the avarage age of users belong to that zipcode, in the descending order of the avarage age. Top 10 means the youngest 10 avarage age of users of that zipcode. (Use of Chaining of two map-reduce job is a must here.) [use only users.dat as input file]

JOINS
=====

Q4. Implement the query using map side join : Find the top 10 users (userID, age,gender) who has rated most number of movies in descending order of the counts.

Q5. List all the movies with its genre where the movie genre is Action or Drama and the average movie rating is in between 4.4-4.7 and only the male users rate the movie.(This is reduce side join).

INPUT TO THE MAP REDUCE CODE
============================

If executing through the submitted JAR file, you need not specify the Class Name

$hadoop jar MovieAnalyzer.jar <input_path> <output_path> <UserRating | MovieGenre> <n | movie_titles> 
	where UserRating displays the users who have rated at least 'n' movies 
		  MovieGenre displays the Genres of all movies specified in the movie_titles 
		  (IMPORTANT!!! : Please separate the movie tiles by :: and give them within "" in command line arguments)
		  Refer the example below. 

$hadoop jar MovieAnalyzer.jar <input_path> <output_path1> <TopTenZip> <output_path2>

	where output_path1 is the output path of the 1st job in the chaining
		  output_path2 is the output path of the 2nd job in the chaining	  

$hadoop jar MovieAnalyzer.jar <input path to ratings.dat> <output_path_1> RatingCount <input path to users.dat> <output_path_2> 

$hadoop jar MovieAnalyzer.jar <input path to users.dat> <input path to movies.dat> <input path to ratings.dat> MovieGenre <output_path1> <output_path2> 

Example:
-------
1. Find the user ids who have rated at least 30 Movies
------------------------------------------------------
$hadoop jar MovieAnalyzer.jar /Spring2014_HW-1/input_HW-1/ratings.dat /user/vxg120430/output UserRating 30

2. Find the Genres of the movie titles "Dancer in the Dark (2000)::Get Carter (2000)"
-------------------------------------------------------------------------------------
(Remember to use quotes in the command line argument. Because of the spaces in the input)

$hadoop jar MovieAnalyzer.jar /Spring2014_HW-1/input_HW-1/movies.dat /user/vxg120430/output1 MovieGenre "Dancer in the Dark (2000)::Get Carter (2000)"

3. Find the Top Ten Youngest age ZipCodes in descending order 
-------------------------------------------------------------
$hadoop jar MovieAnalyzer.jar /Spring2014_HW-1/input_HW-1/users.dat /user/vxg120430/output2 TopTenZip /user/vxg120430/output3

4. Map Side Join
----------------
$hadoop jar MovieJoinAnalyzer.jar /Spring2014_HW-1/input_HW-1/ratings.dat /user/vxg120430/output RatingCount /Spring2014_HW-1/input_HW-1/users.dat /user/vxg120430/output1

5. Reduce Side Join
-------------------
$hadoop jar MovieJoinAnalyzer.jar /Spring2014_HW-1/input_HW-1/users.dat /Spring2014_HW-1/input_HW-1/movies.dat /Spring2014_HW-1/input_HW-1/ratings.dat MovieGenre /user/vxg120430/output2 /user/vxg120430/output3
vivekvenkatesh/moviedata-statistics