In this project, I applied a few natural language processing algorithms to identify which tags that users of movielens--a movie recommendation website--applied best characterize these movies.
Movielens is a movie-recommendation website mainted by GroupLens, University of Minnesota. Data gathered through this site has facilitated numerous studies in Recommender Systems.
My approach is:
- Remove duplicate tags (remove capitalization and punctuations)
- Identify similar tags with the same stem (stemming) and keep only the one with highest quality
- Train word2vec model on IMDB movie reviews, and use the model to identify tags with similar syntactic meanings. Filter out low-quality ones and keep only the one with highest quality.
- I generate a new table that keep track of only high quality tags, write a python script to update the tag data in Movielens' database.