/pydata2016

A couple projects using scikit-learn illustrating project decision making.

Primary LanguageJupyter Notebook

Practical Machine Learning

We offer a foundation in building intelligent business applications using machine learning, walking you through all the steps to prototyping and production—data cleaning, feature engineering, model building and evaluation, and deployment—and diving into an application for anomaly detection and a personalized recommendation engine. All concepts will be presented with example code in Python.

Installation

All of the code we present uses Python 2.7. A number of libraries beyond the standard library are used. We recommend using the conda package manager to install the same versions that we are using, in a manner that won't interfere with your system packages.

Conda

You may install either the full Anaconda Package Manager or the smaller Miniconda system. The former will provide you with over 720 packages, ready to use; the latter will makes it easy to download them when needed. Once one of these are installed, you can install the packages we will be using into a separate environment with

$ conda env create -f environment.yml

This will create a new conda environment named pydata. It can be activated on Linux and OS X with

$ source activate pydata

or on Windows with

> activate pydata

Data

All of the material will use real-world data sets. We recommend that you download them to your personal machine before the day of the workshop. Two applications will be presented.

Recommendation Engine

We will be using the MovieLens 10M data set, assembled by the University of Minnesota. The data are available in a single 63 MB zip file, available at http://files.grouplens.org/datasets/movielens/ml-10m.zip.

Anomaly Detection

We will be using data from the New York CitiBike program. This is available in a number of zip files at https://s3.amazonaws.com/tripdata/index.html. They can be easily downloaded with the provided script:

$ ./download.sh