Comparing Machine Learning Techniques Using Pipelines - Lab

Introduction

In this lab, you'lll use a Dataset created by Otto group, which was also used in a Kaggle competition.

The description of the data set is as follows:

The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). They are selling millions of products worldwide every day, with several thousand products being added to our product line.

A consistent analysis of the performance of our products is crucial. However, due to their global infrastructure, many identical products get classified differently. Therefore, the quality of our product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights Otto Group can generate about their product range.

In this lab, you'll use a data set containing:

  • A column id, which is an anonymous id unique to a product
  • 93 columns feat_1, feat_2, ..., feat_93, which are the various features of a product
  • a column target - the class of a product

Objectives

You will be able to:

  • Compare different classification techniques
  • Construct pipelines in scikit-learn
  • Use pipelines in combination with GridSearchCV

The Data Science Workflow

You will be following the data science workflow:

  1. Initial data inspection, exploratory data analysis, and cleaning
  2. Feature engineering and selection
  3. create a baseline model
  4. create a machine learning pipeline and compare results with the baseline model
  5. Interpret the model and draw conclusions

Initial data inspection, exploratory data analysis, and cleaning

The data is stored in "otto_group.csv".

Things to do here:

  • Check for NAs
  • Check the distributions
  • Check how many inputs there are
  • ...
# Your code here
# Your code here
# Your code here
# Your code here
# Your code here
# Your code here

If you look at all the histograms, you can tell that a lot of the data are zero-inflated, so most of the variables contain mostly zeros and then some higher values here and there. No normality, but for most machine learning techniques this is not an issue.

# Your code here

Because the data is zero-inflated the boxplots look as shown above. Because there are this many zeroes, most values above zero will seem to be outliers. The safe decision for this data is to not delete any outliers and see what happens. With many 0s, sparse data is available and high values may be super informative. More-over, without having any intuitive meaning for each of the features, we don't know if a value of ~260 is actually an outlier.

# Your code here

Feature engineering and selection with PCA

Have a look at the correlation structure of your features using a heatmap.

# Your code here

Use PCA to downscale your features. Use PCA to select a number of features in a way that you still keep 80% of your explained variance.

# Your code here
# Your code here

Create a train test split with a test size of 40%

This is a relatively big training set. Feel free to make it smaller (down to ~20%), but for an initial run you can try smaller training sets so the computation time is more manageable.

For now, simply use the original data and not the principal components. We looked at the PC's first to get a sense of our correlation structure, and to see how we can downsize our data without losing too much information. In what's next, you'll make PCA part of the pipeline!!

# Your code here
# Your code here

Create a baseline model

Create your baseline model in a pipeline setting. In the pipeline

  • Your first step will be to scale your features down to the number of features that ensure you keep just 80% of your explained variance (which we saw before)
  • Your second step will be the building a basic logistic regression model.

Make sure to fit the model using the training set, and test the result by obtaining the accuracy using the test set.

# Your code here
# Your code here
# Your code here

Create a pipeline consisting of a linear SVM, a simple Decision Tree and a simple Random Forest Classifier

Repeat the above, but now create three different pipelines:

  • One for a standard linear SCM
  • One for a default decision tree
  • One for a RandomForestClassifier
# Your code here

Pipeline with grid search

Construct 3 pipelines with grid search

  • one for support vector machines - make sure your grid isn't too big. You'll see it takes quite a while to fit SVMs with non-linear kernel functions!
  • one for random forests - try to have around 40 different models
  • one for the adaboost algorithm.

SVM pipeline with grid search

# Your code here
# Your code here

Use your grid search object along with .cv_results to get the full result overview

# Your code here

Random Forest pipeline with grid search

# Your code here
# Your code here

Adaboost

# Your code here

Note

Note that this solution is only one of many options. The results in the Random Forest and Adaboost models show that there is a lot of improvement possible tuning the hyperparameters further, so make sure to explore this yourself!

Summary

Great! You now got a lot of practice in. What algorithm would you choose and why?