Based on user-movie-rating triple, predict the rating of unseen user-movie pair. Explore different methods, such as Matrix Factorizaton and NN-based methods, and complare the results.
UserID | Gender | Age | Occupation | Zip-code |
---|---|---|---|---|
796 | F | 1 | 10 | 48067 |
3203 | M | 56 | 16 | 70072 |
4387 | M | 25 | 15 | 55117 |
movieID | Title | Genres |
---|---|---|
1 | Toy Story (1995) | Animation |
2 | Jumanji (1995) | Adventure |
3 | Grumpier Old Men (1995) | Comedy |
TrainDataID | UserID | MovieID | Rating |
---|---|---|---|
1 | 796 | 1193 | 5 |
2 | 796 | 661 | 3 |
3 | 796 | 914 | 3 |
4 | 796 | 3408 | 4 |
Baseline on Kaggle
- Strong baseline: RMSE 0.87389
- Simple baseline: RMSE 0.93104
Model Settings
dimension 32, learning rate 0.0003, 175 epoch → RMSE = 0.73801263
dimension 64, learning rate 0.0003, 175 epoch → RMSE = 0.71441079
dimension 84, learning rate 0.0003, 175 epoch → RMSE = 0.7196014
dimension 128 , learning rate 0.0003, 175 epoch → RMSE = 0.715795
Best result: RMSE = 0.71441079
Best result: RMSE = 0.86614
Both methods are better than the strong baseline. However, after experimenting different model settings, MF methods almost always beats DNN. Maybe should try RNN next time.
./movie_predict.sh $datadir $outputfile
T-sne components of movie embeddings
Red :["Children's", "Musical", "Animation" , 'Documentary','Comedy']
Green :['War', 'Crime', 'Sci-Fi','Action', 'Adventure']
Blue :[ 'Drama', 'Romance']
Purple:[ 'Fantasy','Thriller', 'Horror' ]