Project implementation for Udacity Machine Learning Nanodegree. These projects covers different aspects of machine learning, including Supervised Learning, Unsupervised Learning, Reinforcement Learning, Model Evaluation & Validation, etc.
Several python data analytic packages are used for the project implementation.
Numpy: Performs numerical operations.
Pandas: Data I/O, manipulation, and visualization.
Matplotlib, seaborn: Data visualization
scikit-learn: Builds, trains, and tests machine learning models.
An Intro project to Machine Learning. Exploring various variables that can be applied to predict the survival rate of Titanic passengers, including socio-economic class, gender, age, fare, etc. The results implies gender, age, and socio-economic class can be the important variables for prediction.
Supervised Learning. The goal of this project is Finding Donors for Charity.
Data Preprocessing
Log transformation for skewed continuous variables
Data normalization for numerical variables (MinMaxScaler)
One-hot encoding for categorical variables (pandas.get_dummies)
Train, evaluate, and compare three different classifiers, including KNeighborsClassifier, RandomForestClassifier (bagging), GradientBoostingClassifier (boosting) with both accuracy and F-beta-score.
Use grid search and cross-validation to find the parameters for model optimization.
Use principal component analysis (PCA) to reduce the dimensions of the data
Unsupervised Learning: The goal of this project is Creating Customer Segments.
Feature Exploration
Use box plot and histogram to examine the distribution of individual variables
Leverage a matrix of scatter plot and a heatmap to study correlation between variables
Apply multiple coordinate to investigate relationships between multiple variables
Data Preprocessing
Perform feature scaling (using natural logarithm) to reduce the skewness of highly skewed data
Apply Tukey's method to identify the outliers to be removed
Compare the K-means clustering and Gaussian mixture model (GMM) for data clustering.
Apply GMM to perform data clustering, and leverage silhouette coefficient as well as Bayesian information criterion (BIC) to choose the number of clusters.