A collection of data science exercises
- Conducted an exploratory analysis of the relationships between phone name, brand, price, and rating in over 400,000 product reviews from Amazon.com.
- Trained a random forest classifier on 90,000 reviews to achieve a 85% f1-score predicting positive, negative, or neutral sentiments.
- A handmade implementation of Logistic Regression using Tensorflow and NumPy.
- Trained the classifer on a toy moons dataset and visualized its predictions.
- A handmade implementation of Logistic Regression using NumPy.
- Implemented an Early Stopping algorithm during training to prevent overfiting and visualized the training and validation set errors over gradient descent iterations.
- Compared results of batch gradient descent vs. early stopping (virtually the same)
- Created a text classifer to differentiate spam from ham (i.e. legitimate) emails in the Apache Spam Assassin dataset).
- Used the 'Bag of Words' method of feature extraction to create a matrix of word frequencies.
- Scored an accuracy of 99% on the testing set using a Support Vector Machine classifier.
- By examining the feature importances of a Random Forest Classifier, I was able to discover that the key word feature amongst ham emails in the dataset was the presence of the IMAP web protocol in the "received" field.