
Reproducible machine learning notes in Python and R to record my learning journey in Data Science

Primary LanguageJupyter NotebookMIT LicenseMIT



I will continuously update some reproducible machine learning note in R and Python in this repo to record my learning journey in data science.


A list of end to end machine learning projects. Scopes includes data preprocessing, data visualization, model building, parameter tuning, and result interpretation.

  • Titanic: Machine Learning from Disaster: predict what sorts of people were likely to survive from the tragedy. [folder]
  • Music Recommender: build up an end-to-end music recommender application from scratch. [folder]
  • Airbnb New User Bookings: The goal of this project is to help Airbnb predict which country a new user will make his or her first booking. [folder]
  • Forecasting Energy Consumption: Predict energy consumption for 200+ buildings using time series data [folder]


Design of experiments

  • 2018-01-06 Steps to conduct A/B Testings and Caveats[python nbviewer]
    • Hypothesis Testing | Type I error, Type II error, Power | Determining Sample Size
  • 2018-10-20 Inferring Causal Effects from Observational Data[R nbviewer]
    • Propensity Score Matching | MatchIt(library) | CausalImpact(library)
  • 2019-01-30 Solving Multi-Armed Bandit Problem through Epsilon-Greedy Algorithm[python nbviewer]
    • Multi-Armed Bandit | Epsilon Greedy Algorithm | Explore & Exploit

Deep Learning

  • 2018-04-14 Use Transfer Learning to identify upright or sideways of images[python nbviewer]
    • Transfer Learning | keras | data augmentation
  • 2018-04-14 Recognizing hand-written digits using neural network[python nbviewer]
    • Neural Network | MNIST dataset
  • 2018-05-15 Convolutional Neural Network using Keras[python nbviewer]
    • Filter | Padding | Stride | Pooling | Cifer10 dataset | VGG16
  • 2019-02-11 Study Notes on Word Embedding and Word2Vec[python nbviewer]
    • word embedding | word2vec | skip gram | CBOW | text classification

Text Analytics

  • 2018-04-08 Text Classification using Naive Bayes[python nbviewer]
    • Bernoulli Naive Bayes | Multinomial Naive Bayes | Laplace Smoothing
  • 2018-12-29 Sentiment Analysis for Movie Reviews[python nbviewer]
    • NLP Process | N-gram | TF-IDF | Text Preprocessing | POS Tagging
  • 2019-01-29 Topic Modeling through Latent Dirichlet Allocation[python nbviewer]
    • Latent Dirichlet Allocation | Topic Modeling | gensim

KNN Based Modeling

  • 2018-03-19 KNN-Based Modeling[R nbviewer]
    • K-Nearest Neighbors | Local polynomial regression | kernel weighting function

Customer Lifetime Value

  • 2017-10-23 Customer Value calculation using RFM [python nbviwer]
  • 2018-02-27 Calculating Customer Lifetime Value [R nbviwer]
    • Simple retention model | General retention model | Survival Analysis | Markov Chain, Migration Model
  • 2018-04-17 Calculating Customer Lifetime Value using Markov Chain [python nbviewer]
    • Markov Chain | Customer Lifetime Value

Dimension Reduction

  • 2017-12-20 Principal Component Analysis [python jupyter]
    • PCA | eigenvalue & eigenvector

Optimization Method

  • 2017-12-13 Gradient Descent [R nbviwer]
    • Batch Gradient Descent | Stochastic Gradient Descent
  • 2019-01-25 Optimization and Heuristics [python nbviwer]
    • Linear Programming | Piecewise Linear Programming | Shadow Price

Model Selection Method & Explainability

  • 2017-12-15 Model Selection Method [python nbviwer]
    • Cross Validation | Out of Bag Estimate | Grid Search
  • 2019-02-13 Machine Learning Explainability [python nbviwer]
    • Permutation Importance | Partial Dependency Plot | SHAP value |

Tree based models

  • 2017-12-11 Decision Tree Introduction [python nbviwer]
    • Information Gain | Impurity measure | Entropy | Gini Index | Tree Pruning concept
  • 2017-12-11 Bagging and Random Forest [python nbviwer]
    • Ensemble method | Feature importance | Bagging | Random Forest
  • 2017-12-12 Gradient Boosting Machine for Regression [python nbviwer]
    • Boosting | Gradient Descent | GBRT | Pseudo Residual | MLE
  • 2017-12-13 Gradient Boosting Machine for Classificaiton [python nbviwer]
    • Boosting | Cross Entropy | Softmax Function
  • 2017-09-11 xgboost parameter tuning [python jupyter]

Recommender system

  • 2017-09-19 Understand Collaborative Filtering From Scratch [python nbviwer]
    • User-User CF | Item-Item CF
  • 2017-11-24 Build Up My Own Recommended Song Playlist from Scratch [python nbviwer]
    • Latent Factor Model | Alternating Least Squares | Collaborative Filtering


  • 2017-11-1 Linear Regression Model Building Guideline [R nbviwer]
    • Linear Regression | Lasso and Ridge | Model Diagnostics | Model Selection Criterion
  • 2017-11-09 Logistic Regression for binary, nominal, and ordinal response [R nbviwer]
    • Logistic Regression | Maximum probability classifier | Bayes Classifier | ROC, AUC


  • 2017-11-15 Gaussian Mixture Model [python nbviwer]
    • clustering | outlier detection | EM steps | density estimation

Discriminant Analysis

  • 2017-11-18 Discriminant Analysis [R nbviwer]
    • LDA | QDA | Bayes Classifier


  • 2017-12-9 SQL command note [Rmd]
  • 2018-02-19 pandas command note [nbviwer]
  • 2018-04-08 HDFS command note [nbviewer]
  • 2018-05-20 spark command note - RDD [nbviewer]
  • 2018-05-20 spark command note - DataFrame [nbviewer]
  • 2018-06-01 linux command note [nbviewer]
  • 2018-06-01 python plot note [nbviewer]
  • 2018-06-01 python command note [nbviewer]
  • 2018-06-09 hive command note [nbviewer]
  • 2018-06-10 neo4j- Cypher command note [nbviewer]
  • 2018-06-10 hbase command note [nbviewer]
  • 2018-06-14 pig command note [nbviewer]
  • 2018-10-15 regular expression note [nbviewer]

Good Reads

Pending List

  • 2017-2-23 Linear Regression non-traditional model building.
  • 2017-02-19 Random Forest for classification problems
  • 2017-03-01 extreme gradient boosting for classification problems
  • 2017-03-15 gradient boosting tree for classification problems
  • 2017-04-08 using extreme gradient boosting to solve predicting-red-hat-business-value problem from kaggle