Springboard Data Science Mini-Projects

Welcome! This repository contains my data science mini-projects, ranging from data wrangling and statistical inference to machine learning and advanced data visualization.

Data Wrangling

JSON Exercises: The dataset on the World Bank projects is available in a JSON file. I first load the data as a Pandas dataframe and find that China, Indonesia and Vietnam have the most projects with the World Bank. Then I load the JSON file as a string, normalize the project themes, and find that environment and natural resources management, rural development and human development are the project themes with the highest frequencies.

XML Exercises: The Mondial database contains geographical terrains and various physical features in the world. I use xml.etree.ElementTree module to parse the data, which are stored as elements in a hierarchical structure. Each element has a tag, a number of attributes, a text string, and a number of child elements. I find that (1) Monaco, Japan and Bermuda have the lowest infant morality rate; (2) Shanghai, Istanbul and Mumbai have the largest city population; (3) Han Chinese, Europeans and Indo-Aryan are ethnic groups with the largest overall population, and (4) the longest river in the world is Amazonas, the largest lake is Caspian Sea, and the airport at the highest elevation is El Alto International.

SQL with Mode Analytics: In this report, I investigate the possible causes of the drop in user engagement for Yammer, a social network for coworkers to share documents, updates and ideas in group. The time series of daily active users shows a weekly cycle with more users during the weekdays than in weekends. But the striking point is that since early August of 2014 there is a drop from 430 active users on the peak day of the week to 380. This drop cannot be caused by a lack of new signups because the daily signups remain rather constant at around 90 on the peak day. When I divide the active users by the length of usage, the drop in engagement shows up only with the experienced users who had been using Yammer for at least 12 weeks. The engagement level of newer users stays rather constant. Finally, I find that the most likely root cause lies with the email actions, where the email open rate remains at around 30% for these experienced users but the email clickthrough rate drops from 40% to almost 20%. Thus, this drop in engagement reflects not a deficiency in Yammer's social network platform but a lack of interest for companies' weekly digest emails.

Exploratory Data Analysis

Human Body Temperature: Using a dataset of human body temperatures to illustrate the concept of Central Limit Theorem, hypothesis testing with one sample or two samples, and confidence intervals.

Racial Discrimination in US Job Market: The dataset comes from a field experiment in which researchers randomly assign identical resumes with black-sounding or white-sounding names and observe the impact on requests for interviews from employers. Resumes with black-sounding names receive a callback rate of 6.4%, while white names receive a callback rate of 9.7%. This difference of 3.3 percentage points is statistically significant, as the p-value for the test of equality of callback rates is less than 0.001. Moreover, the 99% confience interval suggests that the true callback rate difference could range from 1.2 percentage points to 5.2 percentage points. Therefore, racial discrmination in the U.S. labor market still appears to be a continual challenge.

Recommendations for Reducing Hospital Readmissions: Hospital readmissions have been used as indciators of poor quality of care, such as inadequate discharge planning and care coordination. The goal of the Hospital Readmissions Reduction Program is to reduce such unnecessary and avoidable readmissions. The initial report suggests taht hospitals with smaller number of discharges tend to have a higher excess readmission ratio. But it does not report the correlation coefficient or test whether the correlation is statistically significant. I improve the analysis by finding that the Pearson correlation coefficient is -0.09, suggesting a negative but small correlation between the number of discharges and excess readmission ratio. The p-value of the correlation test is less than 1%, so this negative relationship is indeed statistically significant. However, since the correlation is so small, it is not practical to assume that hospitals with smaller number of dischrages will always have a higher excess readmission ratio. As a result, I do not recommend that hospitals with smaller capacity be required to upgrade their resources or facilities.

Machine Learning

Linear Regression with Boston Housing Dataset: I use scikit-learn library to build a linear regression model to predict the housing prices in Boston. The various features include per capita crime rate, average number of rooms per dwelling, and pupil-teacher ratio by town. I also split the data into training and testing sets in order to measure how well the model built with the training set can predict the 'unseen' data in the test set. I show how multiple rounds of cross-validation performed on different partitions help limit the problem of overfitting a particular training subset and thus reduce variability of the model.

Classification and Logistic Regression: I use cross-validation and grid search to find the best regularization parameter C for the logistic regression. Regularization applies a penalty for increasing the coefficient estimates in order to reduce overfitting. The regularization parameter C in scikit-learn is the inverse of the shrinkage parameter lambda. Larger lambda or smaller C increases the shrinkage pentalty and shrinks the coefficient estimates toward zero. By default scikit-learn sets C=1 in logistic regression, so some amount of regularization is used even if C is not specified. Regularization is good at reducing the variance of the predictions but increasing the bias at the same time. GridSearchCV performs a cross-validated grid-search over a parameter grid. We need to specify an estimation method, parameter values for the estimator and a scoring method. The results show the best estimator, the score of the best estimator, and the parameter setting that yields the best score.

Text Classification with Naive Bayes: I analyze the movie reviews from the rotten tomatoes database. The goal is to train a classifier to predict whether a critic's movie review is 'fresh' or 'rotten.' To preprocess the text, CountVectorizer allows us convert the collection of movie reviews into a matrix of token counts. The parameter min_df is used to removed terms that are too rare, and max_df is used to remove terms that are too common. I then train a multinomial Naive Bayes classifier assuming that features are conditionally independent given the class. In Naive Bayes, alpha is an additive (Laplace/Lidstone) smoothing parameter. A larger alpha will reduce the variance of the model (and overfitting) but increase bias at the same time. We can think of alpha as a pseudocount of the number of times a word has been seen. In the following code, I use grid search to find the best alpha as well as the best min_df that will maximize the probability of observing the training data.

For feature selection, we can create an identity matrix with the size of the number of features/words, each row representing exactly one feature/word. We then use any one word to predict the probabilitiy of freshness or rottenness of a review that contains this word. If one single word can generate high probability of a review being fresh or rotten, that implies this feature has a high predictive power. Reviews containing words such as perfect, touching and masterpiece are likely to be fresh, while words like unfortunately, dull and worst tend to predict rotten reviews.

I also made some improvements of the model in different ways: (1) Include both bigrams and unigrams to capture two-word phrases; (2) Vectorize the reviews based on Term Frequency and Inverse Document Frequency (TF-IDF); and (3) Train a Random Forest classifier with the optimal number of trees chosen by cross-validation.

Customer Segmentation using Clustering: The dataset contains wine offers that were e-mailed to the customers and data on which offers they purchased. Important features of wine offers include wine varietal, the minimum quantity, discount, country of origin and whether or not it is past peak. I merge two spreadsheets and create a pandas dataframe with each row representing a customer and each column representing a wine offer. I first apply K-Means Clustering and use both the elbow method and the Silhouette method to choose the number of clusters. To visualize the clusters, I use Principal Component Analysis to reduce the 32 features into two dimensions. I also compare results from other clustering algorithms: affinity propagation, spectral clustering, agglomerative clustering, and DBSCAN.

For this dataset, the agglomerative clustering performs well in identifying a group who favor Pinot Noir, a group who tend to choose offers requiring only a minimum quantity of six, and two other groups who tend to purchase large quantities of Champagne (one also purchase Chardonnay and the other Espumante and Prosecco). DBSCAN does not perform well because the sparse dataset of 32 features makes it difficult to find clusters and as a result more than half of the data points are classified as noise. The clustering result from affinity propagation is also not very convincing since some of the clusters tend to overlap in the two-dimensional PCA feature space. Spectral clustering is capable of identifying the Pinot-Noir group, the small-offers group and the Champagne group, but the fourth group does not have a clear pattern. Overall, what is consistent across different clustering methods is that some customers tend to purchase Pinot Noir almost exclusively, some tend to focus only on small offers regardless of the wine varitals, and the rest tend to buy Champagne in bulk.

Advance Machine Learning Topics

Building a Recommendation Engine:

Time Series Analysis: pandas can generate series of timestamps as well as time periods using data_range() and period_range(). A time series has a useful .resample() function, which is a time-based groupby, followed by a reduction method (e.g. mean(), var(), and sum()) on each of its groups. For example, one can resample an hourly time series as a daily time series and take the average hourly rate for each day. There is also a .rolling() window function, which aggregates a group of data of a specific window size at each point, followed also by a reduction method. The .expanding() window function is cumulative, meaning that the window size increases along with the time series. This is useful when all historical data are just as important as the recent data. There are some exercises based on real stock time series data.

Before we can build a time series model for forecasting, we need to examine the characteristics of a time series, such as trend and seasonality. One key assumption of ARIMA model is that the time series must be stationary, meaning that there is constant mean, constant variance, and autocovariance does not depend on time. A common way to detrend a time series is to take the first difference. Taking a logarithm transformation can stablize the variance. We can check if a time series has a unit root (non-stationary) by conducting a ADF test. Once the transformed series is likely to satisfy the assumption of stationarity, we can examine the ACF and PACF to determine the initial choice of the AR(p) and MA(q) model orders and seasonal AR(P) and MA(Q) orders. A good practice is to fit many models and select the best model based on a variety of metrics such as AIC, BIC, statistical tests on the residuals, and out-of-sample forecast error.