Executed queries on the indexed documents (library Whoosh) from US government website. Improved Whoosh's baseline performance by at least 30% using NLTK tools (e.g.: adding stop word filter, intra-word filter, lower-word filter and NLTK's stemmers and lemmatizers).
Classify news into 20 news groups using logistic regression and NB. Improve performance by trying different feature sets, feature encodings, amount of data and hyperparrameters. Feature encoding includes binary encoding, TF and TF-IDF. Try softmax approach and one-vs-all approach for multi-class classification.
Built movie recommendation systems based on Popularity, User Average, Similarity (Cosine, Euclidean, Manhattan under User-User, and Item-Item) in Collaborative Filtering, Content-Based Filtering, and Match Box. Evaluated using RMSE, P@k, and R@k. Yielded the optimal recommender “User-Cosine” with mean RMSE of 1.017, mean P@5 of 0.56, and mean R@5 of 0.49
Utilized library Vader to perform sentiment analysis. Employed frequency, mutual information, pointwise mutual information for words and phrases to analyze reviews from TripAdvisor
This assignment involves social network analysis based on twitter data.