CS5344 Big Data Project.
Data Source: Amazon product data
Data should be in the folder data.
- Raw data should be in the form '.json.gz' or '.json'
- Create a virtual environment
mkvirtualenv spark
workon spark
pip install -r requirements.txt
- Process data
Data processing includes
- combine multiple files
- remove duplicates
- remove users who posted less than n reviews, default 5.
- remove products which received less than n reviews.
- Build models
Content-based model
- Calculate the tfidf for reviewText
- Calculate the pair-wise similarity
- Recommend the top ones given a product based on similarity
Collaborative Filtering
- Matrix Factorization based on Explicit Rating
- Given a user/product, recommend the top rating ones.
spark-submit manage.py --master "local[8]"
or
python manage.py