/TutorialMLPython

Tutorial of using scikit-learn+pandas for comparison of machine learning classifiers through a Kaggle problem.

TutorialMLPython

This is a practical introduction to two Python data analysis and machine learning libraries, Pandas and Scikit-learn, through a Kaggle competition problem. The tutorial is an adpatation of the Pycon UK Introductory Tutorial given by Ezzeri Esa. The original version can be found here: https://github.com/savarin/pyconuk-introtutorial

Compared to the orginal Pycon introductory tutorial, more sophiticated analyses have been added in this tutorial for:

  • data exploration and visualisation
  • data preprocessing including and feature selection
  • cross-validation and hyper-parameters tuning for various model types through the use of pipelines
  • model comparison with statistical significance tests: on accuracy and area under the ROC curves estimated from cross-validation

Installation Notes

This tutorial requires pandas, scikit-learn, and best run with the IPython Notebook. If you're not sure how to install these packages, we recommend the free Anaconda distribution.

The materials will be best reviewed with the IPython Notebook. You should be able to type

ipython notebook

in your terminal window and see the notebook panel load in your web browser.

Downloading the Tutorial Materials

You can clone the material in this tutorial using git as follows:

git clone git://github.com/pipalu/TutorialMLPython.git

Alternatively, there is a link above to download the contents of this repository as a zip file.

Static Viewing

The notebooks can be viewed in a static fashion using the nbviewer site, as per the links in the section below. However, we recommend reviewing them interactively with the IPython Notebook.

Presentation Format

The tutorial will start with data manipulation using pandas - loading data, and cleaning data. We then explore the data with some visualisation. We'll then use scikit-learn to make predictions. By the end of the tutorial, we would have worked on the Kaggle Titanic competition from start to finish, through a number of iterations in an increasing order of sophistication.

A Kaggle account would be required for the purposes of making submissions and reviewing our performance on the leaderboard.

Credits

Most credits go to the original instroctor of the Pycon UK Introductory Tutorial, Ezzeri Esa [savarin] (https://github.com/savarin) for providing the excellent tutorial materials through github.