DAT3: A Python repository from vontasa

DAT3 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (10/2/14 - 12/18/14). View student work in the student repository.

Instructors: Josiah Davis and Kevin Markham

Course Project information

Week	Tuesday	Thursday
0		10/2: Introduction
1	10/7: Git and GitHub	10/9: Base Python
2	10/14: Getting and Cleaning Data	10/16: Exploratory Data Analysis
3	10/21: Linear Regression Milestone: Question and Data Set	10/23: Linear Regression Part 2
4	10/28: Machine Learning and KNN	10/30: Model Evaluation
5	11/4: Logistic Regression Milestone: Data Exploration and Analysis Plan	11/6: Logistic Regression Part 2, Clustering
6	11/11: Dimension Reduction	11/13: Clustering Part 2, Naive Bayes
7	11/18: Natural Language Processing	11/20: Decision Trees
8	11/25: Recommenders Milestone: First Draft Due	Thanksgiving
9	12/2: Ensembling	12/4: Ensembling Part 2, Python Companion Tools
10	12/9: Working a Data Problem Milestone: Second Draft Due	12/11: Neural Networks
11	12/16: Review	12/18: Project Presentations

Class 1: Introduction

Introduction to General Assembly
Course overview and philosophy (slides)
What is data science? (slides)
Brief demo of Slack

Homework:

Install Anaconda distribution of Python 2.7, Git, and Slack
Add a photo to your Slack profile
Create a GitHub account
Read Analyzing the Analyzers (40 pages) and think about where you'd like to fit in!

Optional:

Subscribe to some data-focused newsletters, to keep current: Center for Data Innovation, O'Reilly Data Newsletter, Data Community DC
Watch Introduction to Data Science and Analysis (50 minutes) for another look at the data science workflow
Find an open source project hosted on GitHub that interests you

Class 2: Git and GitHub

Homework discussion: Any installation issues? Find any interesting GitHub projects? Any takeaways from "Analyzing the Analyzers"?
Introduce yourself: What's your technical background? Why did you join this course? How do you define success in this course?
Office hours
Git and GitHub lesson (slides)
- Create a repo on GitHub, clone it, make changes, and push up to GitHub
- Fork the DAT3-students repo, clone it, add a Markdown file (about.md) in your folder, push up to GitHub, and create a pull request

Homework:

Review the course project information, past projects from other GA students, and public data sources

Optional:

Clone this repo (DAT3) for easy access to the course files
Watch Introduction to Git and GitHub (36 minutes) to repeat a lot of today's presentation
Read the first two chapters of Pro Git for a much deeper understanding of version control and the basic Git commands
Learn some more Markdown and add it to your about.md file, then push those edits to GitHub and send another pull request
Read this friendly command line tutorial if you are brand new to the command line
For more project inspiration, browse the student projects from Andrew Ng's Machine Learning course at Stanford

Resources:

Dillinger is a browser-based Markdown editor, useful for checking your Markdown code
GitRef is an excellent reference guide for Git commands
Git quick reference for beginners is a shorter reference guide with commands grouped by workflow

Class 3: Base Python

Any questions about Git/GitHub?
Discuss the course project. What's one thing you learned from reviewing student projects?
Base Python lesson, with exercises (code)

Homework:

Complete the exercises at the end of the Python script we went over in class today and add your solutions to your folder in the DAT3-students repo
Keep thinking about your project, and consult past projects and public data sources for more inspiration

Class 4: Getting and Cleaning Data

Discuss homework solutions (code)
File input/output in Python
- Article, original data, modified data
- Open in Sublime Text
- Reading and writing files (code)
Getting data from APIs
- What is an API? Why provide one?
- Apigee: API providers, Echo Nest API console
- Echo Nest Developer Center for API key and documentation
- Three options for reading data into Python (code):
  - curl to file, view file in browser, read with json module
  - Use requests
  - Use Pyechonest

Homework:

Exercise 2 from file input/output
Read What I do when I get a new data set as told through tweets
Watch Look at Your Data (18 minutes)

Optional:

Exercise 3 from file input/output
Read this fun article about using web scraping to analyze Netflix's "micro-genres"

Resources:

Online Python Tutor is useful for visualizing (and debugging) your code
Directory of API wrappers for Python

Class 5: Exploratory Data Analysis

Discuss homework solutions (code)
Scraping the web for data
- What is web scraping? Why use it?
- Web scraping example (code):
  - Pages to scrape using Beautiful Soup 4
  - Adapted from Web scraping 101 with Python
Pandas for data analysis (code)
- Split-Apply-Combine pattern

Homework:

Project milestone: Submit your question and data set to DAT3-students by Tuesday!
Read through this excellent example of data wrangling and exploration in Pandas

Optional:

To learn more Pandas, read through this three-part tutorial (some overlap with today's class), or read through these two excellent (but extremely long) notebooks: Introduction to Pandas, Data Wrangling with Pandas

Resources:

For more web scraping with Beautiful Soup 4, here's a longer example: slides, code
Web scraping without writing any code: "turn any website into an API" with import.io or kimono
Simple examples of joins in Pandas, for when you need to merge multiple DataFrames together

Class 6: Linear Regression

Discuss your project question and data set
Pandas for visualization (code)
Linear regression (code, slides)
- What is linear regression?
- How to interpret the output?
- What assumptions does linear regression depend upon?
- What is multicollinearity and heteroskedasticity, and why should I care?
- How do I represent categorical variables?

Optional:

Post your favorite visualization in the "viz" channel on Slack, and tell us what you like about it!

Resources:

For more on Pandas plotting, browse through this IPython notebook or read the visualization page from the official Pandas documentation
To learn how to customize your plots further, browse through this IPython notebook on matplotlib
To explore different types of visualizations and when to use them, Choosing a Good Chart is a handy one-page reference, and here is an excellent slide deck from Columbia's Data Mining class
If you are already a master of ggplot2 in R, you may prefer "ggplot for Python" over matplotlib: introduction, tutorial

Class 7: Linear Regression Part 2

Linear regression, continued

Homework:

Complete the exercises at the end of the python script from class

Resources:

One of the best places to go for more information about linear regression is chapter 3 of our course "textbook": An Introduction to Statistical Learning - or just read Kevin's highly abbreviated version
For more information about core assumptions, check out this article and this one
For more on log transformations, check out this article
This handout provides an overview of the computation of the F-test
This may be a helpful article on how to derive the coefficient estimates

Class 8: Machine Learning and KNN

Discuss homework solutions (code)
"Human learning" on iris data using Pandas (code)
Introduction to numpy (code)
Machine learning and K-Nearest Neighbors (slides)

Homework:

Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it on Thursday

Optional:

Walk through the rest of the numpy reference and see if you can understand each of the functions

Resources:

For a more thorough introduction to numpy, this guide is quite good

Class 9: Model Evaluation

Introduction to scikit-learn with iris data (code)
Discuss the article on the bias-variance tradeoff
Model evaluation procedures (slides, code)
- Training error
- Underfitting and overfitting
- Test set approach
- Cross-validation
Model evaluation metrics (slides, code)
- Confusion matrix
Introduction to Kaggle

Homework:

Project milestone: Submit your "Data Exploration and Analysis Plan" to DAT3-students by Tuesday!
Read this simple example of machine learning and see if you understand everything in the article
Watch Kevin's Kaggle project presentation video (16 minutes) for a tour of the machine learning process

Optional:

For more on Kaggle, watch the video Kaggle Transforms Data Science Into Competitive Sport (28 minutes)
For much more on the Kaggle Allstate competition, read Kevin's project paper, read a brief interview with the first place team, review the Python code from the second place team, or skim the solution sharing thread
If you want to try out the Kaggle Bike Sharing Demand competition, feel free to reuse Kevin's starter code

Resources:

If you'd like to see more on today's topics, these videos from Hastie and Tibshirani are excellent: bias-variance tradeoff (10 minutes), test set (aka "validation set") approach (14 minutes), cross-validation (14 minutes) - or just read section 5.1 from their book (free PDF download!)
Kevin wrote a simple guide to confusion matrix terminology that you can use as a reference guide
The Kaggle wiki has a decent page describing other common model evaluation metrics

Class 10: Logistic Regression

Any questions from last time: model evaluation, Kaggle, article on Smart Autofill?
Summary of your feedback
Discuss your data exploration and analysis plan
Logistic Regression (slides, code)

Homework:

Continue to work on Part I of the exercise from class and submit your solution to DAT3-students

Class 11: Logistic Regression Part 2, Clustering

Logistic Regression, continued (exercise solution)
Clustering (slides)
- Why cluster?
- Introduction to the K-means algorithm

Homework:

Read through section 8.2 on K-means Clustering from Introduction to Data Mining by next Thursday. What are some of the strengths and limitations of k-means clustering?

Resources:

If you would like a review on the topics we covered today (and Tuesday), the videos from Hastie and Tibshirani from Stanford are very good:
- Introduction to Classification (10 minutes)
- Logistic Regression and Maximum Likelihood (9 minutes)
- Multivariate Logistic Regression and Confounding Variables (10 minutes)
If you want to understand the math of how coefficients are estimated, check out these notes from CMU's Advanced Data Analysis class. Written by Cosma Shalizi, one of CMU's professors.
Documentation for plotting math text
Documentation for plotting scatter plots

Class 12: Dimension Reduction

Model evaluation metrics, continued
- ROC curves and AUC (visualization, code)
- Root Mean Squared Error (slides)
Dimension Reduction (Guest Lecturer: Sinan Ozdemir)
- Slides
- Code: PCA and SVD
- Code: image compression with PCA (original source)

Homework:

Read Paul Graham's "A Plan for Spam" in preparation for Thursday's class on Naive Bayes

Resources:

Kevin has a video tutorial (14 minutes) and blog post summarizing ROC curves and AUC
scikit-learn has extensive documentation on model evaluation
On Cross Validated, this question has dozens of explanations of PCA, and this question has a useful visualization of what is essentially PCA

Class 13: Clustering Part 2, Naive Bayes

Clustering Analysis (slides)
- Understanding the K-means algorithm
- Choosing K for k-means
- Exercises
- Visualizing data in multi-dimensional space
- Limitations of K-means, K-means cluster validation
Naive Bayes (slides)
- Briefly discuss "A Plan for Spam"
- Probability and Bayes' Theorem (original source, confusion matrix reference)
- Naive Bayes classification

Homework:

Open Python, type import nltk, type nltk.download(), find the "NLTK Downloader" popup window, click on "all", then click on "Download". Do this at home, since it's more than 300 MB! If you have space constraints on your computer, we can tell you next class exactly which packages to download.

Resources:

For clustering, scikit-learn has documentation on K-means clustering, alternative clustering algorithms, and clustering metrics
Vipin Kumar from the University of Minnesota has a helpful chapter on clustering from his textbook: Introduction to Data Mining
For an alternative introduction to Bayes' Theorem, Bayes' Rule for Ducks, Bayes' Rule in an animated gif, and this 5-minute video on conditional probability may be helpful
For more details on Naive Bayes classification, Wikipedia has two useful articles: Naive Bayes classifier, Naive Bayes spam filtering
If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004

Class 14: Natural Language Processing

Naive Bayes, continued (code)
- Create a spam classifier using CountVectorizer and MultinomialNB
Natural Language Processing (code)
- Real-world examples
- NLTK: tokenization, stemming, lemmatization, stopwords, Named Entity Recognition (Stanford NER Tagger), TF-IDF, document summarization
- Alternative: TextBlob

Resources:

Natural Language Processing with Python: free online book to go in-depth with NLTK
NLP online course: no sessions are available, but video lectures and slides are still accessible
Brief slides on the major task areas of NLP
Detailed slides on a lot of NLP terminology
A visual survey of text visualization techniques: for exploration and inspiration
DC Natural Language Processing: active Meetup group
Stanford CoreNLP: suite of tools if you want to get serious about NLP
Getting started with regex: Python introductory lesson and reference guide, real-time regex tester, in-depth tutorials

Homework:

We will use Graphviz to visualize the output of the classification trees. Please download this before class.
- In order to download and install for Windows, you will also need to manually add the folder location (e.g., C:\Program Files (x86)\Graphviz2.30\bin;) to your environment path. (download graphviz-2.38.msi)
- You can also download and install for Mac. I am not aware of any issues with installation.

Class 15: Decision Trees

At the end of this class, you should be able to do the following:

Describe the output of a decision tree to someone without a data science background
Describe how the algorithm creates the decision tree
Predict the likelihood of a binary event using the decision tree algorithm in scikit-learn
Create a decision tree visualization
Determine the optimal tree size using a tune grid and the AUC metric in Python
Describe the strengths and weaknesses of a decision tree

Homework:

Work on your project. The first draft of your project is due on Tuesday at 5 pm.

Resources:

Dr. Justin Esarey from Rice University has a nice video lecture on CART that also includes a code walkthrough
For those of you with background in javascript, d3.js has a nice tree layout that would make more presentable tree diagrams
- Here is a link to a static version, as well as a link to a dynamic version with collapsable nodes
- If this is something you are interested in, Gary Sieling wrote a nice function in python to take the output of a scikit-learn tree and convert into json format
- If you are intersted in learning d3.js, this a good tutorial for understanding the building blocks of a decision tree. Here is another tutorial focusing on building a tree diagram in d3.js.
Chapter 8.1 of the Introduction to Statistical Learning also covers the basics of Classification and Regression Trees

Class 16: Recommenders

Recommenders (Guest Lecturer: Sinan Ozdemir)
- Slides
- Framework code, solution

Class 17: Ensembling

Resources:

Leo Brieman's paper on Random Forests
yhat has a brief primer on Random Forests that can provide a review of many of the topics we covered today.
Here is a link to some Kaggle competitions that were won using Random Forests
Ensemble models... tend to strongly outperform their component models on new data. Doesn't this violate “Occam’s razor”? In this paper entitled: The Generalization Paradox of Ensembles John Elder IV argues for a more refined understanding of model complexity.

Class 18: Ensembling Part 2, Python Companion Tools

Ensembling
- Review the Random Forest algorithm
- Discuss tuning the Random Forest Algorithm
- Give an overview of the AdaBoost algorithm
- Implement boosted trees in Python
IPython Notebook
- nbviewer
- Biostatistics course: GitHub repo, direct links to nbviewer
- A gallery of interesting IPython Notebooks

Resources:

Chapter 10 of the Elements of Statistical Learning covers Boosting. See page 339 for the algorithm presented in class.
Dr. Justin Esary has a nice tutorial on Boosting. Watch from 32:00 – 59:00 for relevant material.
Tutorial by Professor Rob Schapire of Princeston on the AdaBoost Algorithm
IPython documentation in website form and notebook form: does not focus exclusively on the IPython Notebook

Class 19: Working a Data Problem

Kaggle: Avazu Click-Through Rate Prediction

Class 20: Neural Networks

Inspiration for Neural Networks
Neural Networks
Gradient Descent

Resources:

Michael Neilson has a free book called Neural Networks and Deep Learning that gives thorough introduction to Neural Networks
Geoffrey Hinton, one of the pioneers of the deep learning movement, has an entire class on Coursera called Neural Networks for Machine Learning
The Python wiki has a list of Python packages commonly used for Neural Networks

Homework:

Read this "classic" paper, which may help you to connect many of the topics we have studied throughout the course: A Few Useful Things to Know about Machine Learning

Class 21: Review

Review of all of data science (slides)
Comparing supervised learning algorithms
Special guest: Laura Lorenz

Resources:

scikit-learn "machine learning map": Guide for choosing the optimal estimator
Choosing a Machine Learning Classifier: Short and highly readable
Machine Learning Done Wrong: Thoughtful advice on common mistakes to avoid in machine learning
Practical machine learning tricks from the KDD 2011 best industry paper: More advanced advice than the resources above
An Empirical Comparison of Supervised Learning Algorithms: Research paper from 2006
Getting in Shape for the Sport of Data Science: 75-minute video of practical tips for machine learning (by the past president of Kaggle)
Resources for continued learning
Bonus content (below)
Keep using Slack!

Class 22: Project Presentations

Note: Guests are welcome! Invite your friends and family!

Bonus Content

Python reference guide
Regular expressions ("regex"):
- Simple example of regex usage: code, data
- Detailed reference guide
- Python introductory lesson
- In-depth tutorials
- Real-time regex tester
- Fun example: Exploring Expressions of Emotions in GitHub Commit Messages
Tidy data
- Brief summary of tidy data
- Tidy data resources from Hadley Wickham: detailed paper, shorter version of the paper with more R code, slides from his class on tidy data, video presentation
- Examples: tidy, untidy, very untidy
- Common issues with obtaining tidy data from Excel
Reproducibility
- Overview of reproducibility and reproducible research
- Twitter definition of reproducibility
- How to share data with a statistician: practical guide for turning raw data into tidy data in a reproducible and documented manner
- Example of reproducible data processing: includes raw data, processed data, processing scripts, and documentation
- Colbert on reproducibility (8-minute video)
Domino Data Lab
- Primary use cases: run your model in the cloud, create a self-service web form or API that interacts with your model
- Usage rates and subscription pricing
- Command reference

vontasa/DAT3

DAT3 Course Repository

Class 1: Introduction

Class 2: Git and GitHub

Class 3: Base Python

Class 4: Getting and Cleaning Data

Class 5: Exploratory Data Analysis

Class 6: Linear Regression

Class 7: Linear Regression Part 2

Class 8: Machine Learning and KNN

Class 9: Model Evaluation

Class 10: Logistic Regression

Class 11: Logistic Regression Part 2, Clustering

Class 12: Dimension Reduction

Class 13: Clustering Part 2, Naive Bayes

Class 14: Natural Language Processing

Class 15: Decision Trees

Class 16: Recommenders

Class 17: Ensembling

Class 18: Ensembling Part 2, Python Companion Tools

Class 19: Working a Data Problem

Class 20: Neural Networks

Class 21: Review

Class 22: Project Presentations

Bonus Content