This example was taken from the following AWS guide. This is a movie recommendation application based on the user ratings.
https://aws.amazon.com/blogs/big-data/building-a-recommender-with-apache-mahout-on-amazon-elastic-mapreduce-emr/
Follow these steps,
-
Start EMR cluster with hadoop/ mahout. Use the default configurations on AWS EMR which is fine.
-
Connect to the Master node via a SSH terminal.
-
Get the movie rating data from following link.
wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
unzip ml-1m.zip -
Convert ratings.dat to ratings.csv by running following command.
cat ml-1m/ratings.dat | sed 's/::/,/g' | cut -f1-3 -d, > ratings.csv
-
Put the data (ratings.csv) into HDFS cluster.
hadoop fs -put ratings.csv /ratings.csv
-
Use mahout recommenderitembased with the data on the cluster.
mahout recommenditembased --input /ratings.csv --output recommendations --numRecommendations 10 --outputPathForSimilarityMatrix similarity-matrix --similarityClassname SIMILARITY_COSINE
-
Use following commands to see the recommendations generated for each of the user.
hadoop fs -ls recommendations
hadoop fs -cat recommendations/part-r-00000 | head -
Install python libraries needed to run the web service.
sudo easy_install twisted
sudo easy_install klein
sudo easy_install redis -
Install and run redis service on the master node.
wget http://download.redis.io/releases/redis-2.8.7.tar.gz
tar xzf redis-2.8.7.tar.gz
cd redis-2.8.7
make
./src/redis-server & -
Copy the hello.py provided in this repo to this folder.
-
Start the web service.
twistd -noy hello.py &
-
Use following command to request the endpoint.
curl localhost:8090/37
-
Make sure to terminate the EMR cluster after everything is done. This will return the recommended movies with their ratings for user id 37