/dataweek-workshop

Machine learning workshop using Python, pandas, and scikit-learn. The first half of the day covered supervised classification using Logistic Regression and how to use cross validation to evaluate your models . The second half of the day covered unsupervised clustering with Kmeans as well as an overview of the data science process.

Introduction to Machine Learning with Python

Welcome to Zipfian Academy's Machine Learning workshop. Thank you for attending, we hope you enjoyed the lecture (we sure had fun presenting). This exercise will give you hands-on experience with the concepts covered, and will help solidify your understanding of the process of data science.

Getting Help

As always, feel free to email us about anything at all (questions, issues, concerns, feedback) at class@zipfianacademy.com. We would love to hear how you liked the class, whether the content was technical enough (or too technical), or any other topics you wish were covered.

Next Steps

We hope you have fun with this exercise! If you want to learn more or dive deeper into any of these subjects, we are always happy to discuss (and can talk for days about these subjects). And if you just can't get enough of this stuff (and want a completely immersive environment), you can apply for our intensive data science bootcamp starting January 20th.

Learning Python

This assignment assumes a basic familiarity with Python and is intended to teach you how to leverage it for data science. If you do not feel comfortable enough with Python (and programming in general) I recommend these (freely available) resources:

Setup and Environment

This exercise is written in an IPython notebook and uses many of wonderful libraries from the scientific Python community. While you do not need IPython locally to complete the exercise (there are PDF and .ipynb versions of these instructions), I recommend setting it up on your computer if you plan to continue learning and playing with data. IPython notebooks not only provide an interface to interactively run (and debug) code in a web browser, but also to document your file as you go along. Below are the steps to setup a scientific Python environment on your computer to complete this (and all future class') assignment. If you have tips or suggestions to make this process easier, please reach out either on Piazza or via email.

Version control and Environment Isolation

  • Git: Distributed Version Control to keep track of changes and updates to files/data.
  • virualenv: Python environment isolation to help manage dependencies with packages and versions.
  • pythonbrew: Manage and install multiple versions of Python. Can be handy if you want to experiment with Python 3.x.

Scientific Python packages

  • Enthought Python Distribution: A freely available packaged environment for scientific Python.
  • Scipy Superpack: Only for Mac OSX, but a one line shell script that installs all the fundamental scientific computing packages.
  • pandas: Data analysis and statistical library providing functionality in Python similar to R.

if you are on OSX, you may need to install Xcode (with command line utilities) or install gcc directly

Tutorial walking you through the installation of these tools, with tests to make sure it all works.

User knowledge modelling

In this tutorial we will be using the Grockit Question logs dataset to predict the probability of getting the next question correct. We will also cluster the data to find similar students. Once we know which students will perform worse than the others (classifier), we can recommend similar (clustering) students who performed well to study with.

Resources

Outline

  1. Get the Data
  2. Preparation -- vectorization and feature preparation (engineering)
  3. Train -- fit/build model from known labeled data
  4. Test -- evaluate model with cross validation
  5. Predict -- run model on data with unknown labels

Goals

  • Understand the various stages of the ML pipeline
    • Obtain
    • Prepare
    • Train
    • Test
    • Predict
  • Get experience building models with scikit-learn
  • Decision Boundaries
  • Cost Function
  • Logistic Regression and the sigmoid function
  • Cross Validation
    • K-fold
    • Hold out
  • Optimization functions
  • Classification vs. Regression
  • Supervised vs. Unsupervised learning
  • Kmeans clustering
  • Distance functions (similarity)