/GADS11-NYC-Summer2014

Lecture Repository for GADS11

Primary LanguageOpenEdge ABL

Data Science Course: Lectures and Materials

Issues: For questions, answers and discussions:
Viewing Your and Other Student Work:

iPython Notebook Viewer for this class's student repo

Git Workflow and Command Line Tips:

Class Meetings

Introduction to Data Science

Monday, 3/31/14

Class Materials

Data Collection and Extraction

Wednesday, 4/7/14

Project 1 Introduced

Class Materials

Additional Resources:

#####Learning how to use the file pager, less

Python Documentation

Handy to have this in your bookmarks!

Couple extra handy python introductions
Beautiful Soup Tutorials
APIs to play with

Numpy

Wednesday 4/9/2014

Class Materials

Additional Resources

Pandas

Monday 4/14/2014

Class Materials

Data Visualization and MatPlotLib

Wednesday 4/16/2014

Class Materials

Lecture Notes: Data Visualization

Python Notebook: Plotting with Matplotlib

Assignments Due

  • Complete and submit previous assignments

Additional Resources

Resource About
Basic Plotting in Pandas
Matplotlib userguide
Matplotlib Gallery Examples with Code
Rougier and Prace EuroSciPy Matplotlib Tutorial Short Overview

Exploratory Data Analysis

Monday 4/21/2014 We'll be reviewing a number of datasets and going through the Data Exploration Process

The ACES model for Data Exploration:

Letter Step Notes
A Acquire the data and Assemble the data frame Find data, import into Pandas
C Clean the data frame Identify and limit columns, rows, indices, dates, etc.
E Explore global properties Visualize! Basic plots and stats appropriate to the data set
S Subset comparisons Look at (visualize!) initial emergenet variable relationships and subsets

Class Materials

Resources

Assignments Due

N/A - Please review all prior materials and work on Project 1.

Presentations, Machine Learning, and Data Science Careers

Wednesday 4/23/2014

Assignments Due

[Project 1: Scraping, APIs, and Data Visualization](Project 1https://github.com/datadave/GADS9-NYC-Spring2014-Lectures/blob/master/projects/project01.md)

Class Outline

  • Selected Presentations of Student Projects
  • Discussion of Data Science Careers
  • Introduction to Machine Learning

Linear Regressions

Monday 4/28/2014 We'll be discussing the linear regression algorithm and learn about scoring regression models

Class Materials

Assignments Due

Please submit three optimized models using the data/day.csv file in an ipython notebook or python script for each y variable casual, registered, and cnt. Please put this in a lab_submissions/lab07/yourname folder.

More Reading

Resource About
Regressions with Sklearn
Overfitting Regressions
Guide to Logistic Regression
Khan Academy Algebra Review
MIT OCW

Naive Bayes

(link to lesson folder)

Wednesday 4/30/2014 We'll be reviewing some basics of probability, developing ways to work with text data, and using a classification algorithm to classify text.

Objectives

  • Articulate Naive Bayes' advantages, flaws, applications and theoretical foundation
  • Explain how Naive Bayes is applied to classify text or Spam
  • Be familiar with using the N.B. classifiers in NLTK and SKLearn
  • Create a basic Naive Bayes classifier

Materials

Assignments

  • Add a feature to the NLTK gender classifier to try and improve performance
  • Create a classifier to tell the difference between two authors
  • Brainstorm classification topics for projects (due May 14)

Follow Up Notes

Based on student feedback:

Classifier Comparison and Logistic Regression

Wednesday 5/7/2014

Objectives

  • Understand how to apply logistic regression to a classification problem
  • Create a two dimensional feature space to evalute the performance of classifiers
  • Leverage the interoperability of SKLearn classifiers to compare KNN, Naive Bayes, Decision Trees and Logistic Regression on a single classification problem

Materials

The lesson notebook provides:

  • A brief background on logistic classification
  • A mesh function using np.meshgrid to evaluate the predictive functions on a 2 dimensional feature grid

The intention is to provide a starting template with which to contrast various classifiers on a clean, real-world data set.

Static Ipython Notebook

Assignments

Students are expected add additional classifiers to the notebook, experiment with parameters, and develop conclusions about the differences between classifier performance on the given sample data.

Your Centroid or Mine? An Introduction to K-Means

Monday, May 19th

Humans are good, often too good, at clustering, and its another realm of our intelligence that we can programatically apply to machines. Toddlers can tell that objects are boats, flags, and doggies -- but how?

Machine clustering is used to categorize the web, understand galaxies, organize genetic, segment customers, classify mental illness, and detect disease patterns, to name just a few applications.

K Means is a very simple algorithm for classifying that works well and is by far the most widely used.

Here's some resources to get started:

Recommended Resources

| Title | Author | Type | Length | Difficulty | Description | Rating (1 to 4 Stars) | ----- | ----- | ---- | ----- | ------ | --- | --- | --- | |Cluster Analysis and K-Means| Kumar, UMN | PDF Excerpt | 40 pages | Intermediate | Good chapter overview on clustering and then section "8.2.1 Basic K-Means Algorithm" gives great K-Means summary. The rest of 8.2 includes complications wth K-Means and concludes with the optimization math. If you just want the minimum, read pgs. 498 to 505 | ++++ | Clustering Overview | StanfordML | html page | 3 pages | Intermediate | Good, quick overview of everything | ++++ | K-Means Clustering | Mathematical Monk | Video | 15 minute | Novice | Good Kahn style overview of math | +++ | K-Means Wikipedia Entry | Everyone | Wikipedia | 6 pages | Intermediate | Includes Iris and 'mickey mouse' we'll be looking at. | ++

Class Lecture

Review of Random Forests and the Ensemble Learning Approach

Wednesday, May 21st

We'll wrap up all machine learning material by having a more in detail discussion about the differences between bagging (bootstrapping), boosting, and random forests.

Recommended Resources

Title Author Type Length
Ensemble Learning Wikipedia Article ---
sklearn doc Scikey-learn Documentation ---
yhat Blog on Random Forests yHat blog article ---
Ensemble Methods in Machine Learning Dietterich, Thomas PDF Journal 15 pages
A Few Useful Things to Know about Machine Learning Domingos, Pedro PDF Journal 9 pages
Ensemble Methods Hyer, Jay Presentation 31 Slides
Kaggle Random Forests Kaggle Kaggle ---

Class Lecture

Updates expected -- See Lesson Folder for further details