TangHan54/Amazon_Products_Recommender_System

Using Pyspark to develop content based and collaborative filtering recommender.

Python

Amazon_Products_Recommender_System

CS5344 Big Data Project.

Data Source: Amazon product data

Data should be in the folder data.

Raw data should be in the form '.json.gz' or '.json'

Create a virtual environment

mkvirtualenv spark
workon spark
pip install -r requirements.txt

Process data

Data processing includes

combine multiple files
remove duplicates
remove users who posted less than n reviews, default 5.
remove products which received less than n reviews.

Build models

Content-based model

Calculate the tfidf for reviewText
Calculate the pair-wise similarity
Recommend the top ones given a product based on similarity

Collaborative Filtering

Matrix Factorization based on Explicit Rating
Given a user/product, recommend the top rating ones.

spark-submit manage.py --master "local[8]"

or

python manage.py