-
Text mining with Amazon’s book review data
• Scraped Amazon’s book price and reviews. Used tidytext package in R to remove stop words and tokenlize documents. Created word clouds and network graphs to display the most frequent words and connections between words in different type of reviews. • Performed sentiment analysis using Google’s Natural Language API. Built a non-linear model to predict rating for unrated new books with MSE of 0.5. -
Population clustering using CDC’s 2016 Annual Survey Data
• Used ggplot for data visualization and discovered prevalent epidemic diseases in the states. • Performed PCA for dimension reduction and K-Means algorithm for clustering. Generated clusters in the population with distinct health conditions and found correlations between behaviors and chronic diseases. -
Bad Loan Prediction with Lending Club’s data
• Used R for data cleaning, missing-data imputation and data transformation. Using H2O package, applied Neural Network, Random Forest, Naïve Bayes and other algorithms to predict bad loans. • Compared and calculated variable importance for each model, the best model achieved accuracy of 67.8%. -
Clustering svm;kmeans
• Use SVM and K-means algorithm to create clusters with iris dataset.