SF_DAT_17: A Jupyter Notebook repository from ovlas26

SF DAT 17 Course Repository

Course materials for General Assembly's Data Science course in San Francisco, CA (9/14/15 - 12/02/15).

Instructors: Sinan Ozdemir (who is super cool!!!!!!)

Teaching Assistants: David, Matt, and Sri (who are all way more awesome)

Office hours: All will be held in the student center at GA, 225 Bush Street

Monday	Wednesday
9/14: Introduction / Expectations / Intro to Data Science	9/16: Git / Python
9/21: Data Science Workflow / Pandas	9/23: More Pandas!
9/28: Intro to Machine Learning / Numpy / KNN	9/30: Scikit-learn / Model Evaluation Project Milestone: Question and Data Set HW Homework 1 Due
10/5: Linear Regression	10/7: Logistic Regression
10/12: Columbus Day (NO CLASS)	10/14: Working on a Data Problem
10/19: Clustering	10/21: Natural Language Processing
10/26: Naive Bayes Milestone: First Draft Due	10/28: Decision Trees
11/2: Ensembling Techniques	11/4: Dimension Reduction Milestone: Peer Review Due
11/9 Support Vector Machines	11/11: Web Development with Flask
11/16: Recommendation Engines	11/18: Neural Networks Continued
11/23: SQL	11/25: Turkey Day (NO CLASS)
11/30: Projects	12/2: Projects

Installation and Setup

Install the Anaconda distribution of Python 2.7x.
Install Git and create a GitHub account.
Once you receive an email invitation from Slack, join our "SF_DAT_17 team" and add your photo!

Resources

PEP 8 - Style Guide for Python

Class 1: Introduction / Expectations / Intro to Data Science

Introduction to General Assembly
Course overview: our philosophy and expectations (slides)
Intro to Data Science: (slides)
Tools: check for proper setup of Git, Anaconda, overview of Slack

####Homework

Make sure you have everything installed as specified above in "Installation and Setup" by Wednesday

Class 2: Git / Python

Introduction to Git
Intro to Python: (code)

####Homework

Go through the python file and finish any exercise you weren't able to in class
Make sure you have all of the repos cloned and ready to go
- You should have both "SF___DAT___17" and "SF___DAT___17__WORK"
Read Greg Reda's Intro to Pandas

Resources:

In depth Git/Github tutorial series made by a GA_DC Data Science Instructor here
Another Intro to Pandas (Written by Wes McKinney and is adapted from his book)
- Here is a video of Wes McKinney going through his notebook!
Class 3: Pandas

Agenda

Intro to Pandas walkthrough here
- I will give you semi-cleaned data allowing us to work on step 3 of the data science workflow
- Pandas is an excellent tool for exploratory data analysis
- It allows us to easily manipulate, graph, and visualize basic statistics and elements of our data
- Pandas Lab!

Homework

Begin thinking about potential projects that you'd want to work on. Consider the problems discussed in class today (we will see more next time and next Monday as well)
- Do you want a predictive model?
- Do you want to cluster similar objects (like words or other)?

Resources:

Pandas
- Split-Apply-Combine pattern
- Simple examples of joins in Pandas
- Check out this excellent example of data wrangling and exploration in Pandas
  - For an extra challenge, try copying over the code into your own .py file
- To learn more Pandas, review this three-part tutorial
- For more on Pandas plotting, read the visualization page from the official Pandas documentation.

Class 4 - More Pandas

Agenda

Class code on Pandas here
We will work with 3 different data sets today:
- the UFO dataset (as scraped from the reporting website
- Fisher's Iris Dataset (as cleaned from a machine learning repository
- A dataset of (nearly) every FIFA goal ever scored (as scraped from the website)
Pandas Lab! here

####Homework

Please review the readme for the first homework. It is due NEXT Wednesday (9/30/2015)
The one-pager for your project is also due. Please see project guidelines

Class 5 - Intro to ML / Numpy / KNN

####Agenda

Intro to numpy code
- Numerical Python, code adapted from tutorial here
- Special attention to the idea of the np.array
Intro to Machine Learning and KNN slides
- Supervised vs Unsupervised Learning
- Regression vs. Classification
Iris pre-work code and code solutions
- Using numpy to investigate the iris dataset further
- Understanding how humans learn so that we can teach the machine!
Lab to create our own KNN model

####Homework

The one page project milestone as well as the pandas homework!
Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
- In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
- In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
- In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
- How does the choice of K affect model bias? How about variance?
- As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
- Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
- Does a high value for K cause over-fitting or under-fitting?

Resources:

For a more in-depth look at machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)

Class 6: scikit-learn, Model Evaluation Procedures

Introduction to scikit-learn with iris data (code)
Exploring the scikit-learn documentation: user guide, module reference, class documentation
Discuss the article on the bias-variance tradeoff
Look as some code on the bias variace tradeoff
- To run this, I use a module called "seaborn"
- To install to anywhere in your terminal (git bash) and type in sudo pip install seaborn
Model evaluation procedures (slides, code)

Homework:

Keep working on your project. Your data exploration and analysis plan is due in three weeks!

Optional:

Practice what we learned in class today!
- If you have gathered your project data already: Try using KNN for classification, and then evaluate your model. Don't worry about using all of your features, just focus on getting the end-to-end process working in scikit-learn. (Even if your project is regression instead of classification, you can easily convert a regression problem into a classification problem by converting numerical ranges into categories.)
- If you don't yet have your project data: Pick a suitable dataset from the UCI Machine Learning Repository, try using KNN for classification, and evaluate your model. The Glass Identification Data Set is a good one to start with.
- Either way, you can submit your commented code to your SF_DAT_15_WORK, and we'll give you feedback.

Resources:

Here's a great 30-second explanation of overfitting.
For more on today's topics, these videos from Hastie and Tibshirani are useful: overfitting and train/test split (14 minutes), cross-validation (14 minutes). (Note that they use the terminology "validation set" instead of "test set".)
- Alternatively, read section 5.1 (12 pages) of An Introduction to Statistical Learning, which covers the same content as the videos.
This video from Caltech's machine learning course presents an excellent, simple example of the bias-variance tradeoff (15 minutes) that may help you to visualize bias and variance.

Class 7: Linear Regression

Linear regression (notebook, notebook code)
Yelp Lab here with the Yelp reviews data. It is not required but your next homework will involve this dataset so it would be helpful to take a look now!

Homework:

Watch these videos on probability and odds (8 minutes) if you're not familiar with either of those terms.
Read these excellent articles from BetterExplained: An Intuitive Guide To Exponential Functions & e and Demystifying the Natural Logarithm (ln).

Resources:

Setosa has an excellent interactive visualization of linear regression.
To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning, from which this lesson was adapted. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on simple linear regression and multiple linear regression.
This introduction to linear regression is much more detailed and mathematically thorough, and includes lots of good advice.
This is a relatively quick post on the assumptions of linear regression.
John Rauser's talk on Statistics Without the Agonizing Pain (12 minutes) gives a great explanation of how the null hypothesis is rejected.
A major scientific journal recently banned the use of p-values:
- Scientific American has a nice summary of the ban.
- This response to the ban in Nature argues that "decisions that are made earlier in data analysis have a much greater impact on results".
- Andrew Gelman has a readable paper in which he argues that "it's easy to find a p < .05 comparison even if nothing is going on, if you look hard enough".
An article on "P Hacking" the idea that you can alter data in order to achieve good p values

Class 8: Logistic Regression

Logistic regression (notebook, notebook code)
Confusion matrix (slides)
Exercise with Titanic data (instructions, solution)

Homework:

If you aren't yet comfortable with all of the confusion matrix terminology, watch Rahul Patwari's videos on Intuitive sensitivity and specificity (9 minutes) and The tradeoff between sensitivity and specificity (13 minutes).

Resources:

To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
Supervised learning superstitions cheat sheet is a very nice comparison of four classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes) and one classifier we do not cover (Support Vector Machines).
This simple guide to confusion matrix terminology may be useful to you as a reference.

Class 9: Working on a Data Problem

Today we will work on a real world data problem! We will have 3 options.
Option 1: (stocks) Use stock data from over 7 months of a fictional company ZYX including twitter sentiment, volume and stock price. Our goal is to create a predictive model that predicts forward returns. data here
- Project overview (slides)
  - Be sure to read documentation thoroughly and ask questions! We may not have included all of the information you need...
Option 2: Using ingredients to predict the type of recipe (Kaggle)[https://www.kaggle.com/c/whats-cooking]
Option 3: San Francisco Crime Classification (Kaggle)[https://www.kaggle.com/c/sf-crime]

Class 10: Clustering and Visualization

The slides today will focus on our first look at unsupervised learning, K-Means Clustering!
The code for today focuses on two main examples:
- We will investigate simple clustering using the iris data set.
- We will take a look at a harder example, using Pandora songs as data. See data. See code here
- Checking out some of the limitations of K-Means Clutering here

Homework:

Project Milestone 2 is due in one week!
Download all of the NLTK collections.
- In Python, use the following commands to bring up the download menu.
- import nltk
- nltk.download()
- Choose "all".
- Alternatively, just type nltk.download('all')
Install two new packages: textblob and lda.
- Open a terminal or command prompt.
- Type pip install textblob and pip install lda.

Resources:

Introduction to Data Mining has a nice chapter on cluster analysis.
The scikit-learn user guide has a nice section on clustering.

##Class 11: Natural Language Processing

Agenda

Naural Language Processing is the science of turning words and sentences into data and numbers. Today we will be exploring techniques into this field
code showing topics in NLP
lab analyzing tweets about the stock market

Homework:

Read Paul Graham's A Plan for Spam and be prepared to discuss it in class on Wednesday. Here are some questions to think about while you read:
- Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
- Before he tried the "statistical approach" to spam filtering, what was his approach?
- How exactly does his statistical filtering system work?
- What did Paul say were some of the benefits of the statistical approach?
- How good was his prediction of the "spam of the future"?
Below are the foundational topics upon which Wednesday's class will depend. Please review these materials before class:
- Confusion matrix: a good guide roughly mirrors the lecture from class 10.
- Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
- Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.
You should definitely be working on your project! First draft is due Monday!!

##Class 12: Naive Bayes Classifier

Today we are going over advanced metrics for classification models and learning a brand new classification model called naive bayes!

Agenda

Learn about ROC/AUC curves
- Slides here
- Code here
Learn the Naive Bayes Classifier
- Slides here
- Code here
- In the code file above we will create our own spam classifier!

Resources

Video on ROC Curves (12 minutes).
My buddy's blog post about the ROC video includes the complete transcript and screenshots, in case you learn better by reading instead of watching.

##Class 13: Decision Trees

We will look into a slightly more complex model today, the Decision Tree.

Agenda

Slides here
Notebook(here
Some more Code here

Homework

Project reviews due next Wednesday!

Resources

Chapter 8.1 of An Introduction to Statistical Learning also covers the basics of Classification and Regression Trees
The scikit-learn documentation has a nice summary of the strengths and weaknesses of Trees.
For those of you with background in javascript, d3.js has a nice tree layout that would make more presentable tree diagrams:
- Here is a link to a static version, as well as a link to a dynamic version with collapsable nodes.
- If this is something you are interested in, Gary Sieling wrote a nice function in python to take the output of a scikit-learn tree and convert into json format.
- If you are intersted in learning d3.js, this a good tutorial for understanding the building blocks of a decision tree. Here is another tutorial focusing on building a tree diagram in d3.js.
Dr. Justin Esarey from Rice University has a nice video lecture on CART that also includes an R code walkthrough

Class 14: Ensembling

Ensembling (IPython notebook)
BONUS: Regularization (IPython notebook)
- Bonus advanced sklearn code here

Resources:

scikit-learn documentation: Ensemble Methods
Quora: How do random forests work in layman's terms?

Class 15: Dimension Reduction

Fun with indeed here
PCA
- Slides
- Code: PCA and SVD
- Code: image compression with PCA (original source)

Resources

Some hardcore math in python here
PCA using the iris data set here and with 2 components here
PCA step by step here
Check out Pyxley

Class 16: Neural Networks and SVM

Agenda

slides here
code here
Lab here

Resources

An intro to Neural Networks here
An intro to SVM
SVM Margins Example here
SVM digits was adapted from here
Google Deep Dream: why does it always see dogs?!
Deep Dream Generator
Most used non-sklearn ANN PyBrain
Step by Step back propagation here

Class 18: Recommendation Engines

Recommendation Engines slides
Recommendation Engine Example code

Resources:

The Netflix Prize
Why Netflix never implemented the winning solution
Visualization of the Music Genome Project
The People Inside Your Machine (23 minutes) is a Planet Money podcast episode about how Amazon Mechanical Turks can assist with recommendation engines (and machine learning in general).
Next class we will have a talk from three engineers from OpenGov discussing NLP tactics used for governments around the world!

Class 19: More Neural Networks

We will need a new package! sudo pip install pybrain
Recap here
Let's build our own! here
Let's use Pybrain! here
A talk from OpenGov

Resources

Code adapted from here and here
Calculus adapted from here
Sklearn will come out with their own supervised neural network soon! here

Class 20: Databases and SQL

Slides here
Code here
Lab here with solutions here

Resources

This GA notebook provides a shorter introduction to databases and SQL that helpfully contrasts SQL queries with Pandas syntax.
SQLZOO, Mode Analytics, Khan Academy, Codecademy, Datamonkey, and Code School all have online beginner SQL tutorials that look promising. Code School also offers an advanced tutorial, though it's not free.
w3schools has a sample database that allows you to practice SQL from your browser. Similarly, Kaggle allows you to query a large SQLite database of Reddit Comments using their online "Scripts" application.
What Every Data Scientist Needs to Know about SQL is a brief series of posts about SQL basics, and Introduction to SQL for Data Scientists is a paper with similar goals.
10 Easy Steps to a Complete Understanding of SQL is a good article for those who have some SQL experience and want to understand it at a deeper level.
SQLite's article on Query Planning explains how SQL queries "work".
A Comparison Of Relational Database Management Systems gives the pros and cons of SQLite, MySQL, and PostgreSQL.
If you want to go deeper into databases and SQL, Stanford has a well-respected series of 14 mini-courses.
Blaze is a Python package enabling you to use Pandas-like syntax to query data living in a variety of data storage systems.

Next Steps

The hardest thing to do now is to stay sharp! I have a few recommendations on next steps in order to make sure that you don't forget what we learned here!

Always stay up to date on Kaggle
- Try working with some other people in this class!
- Our Slack channel will stay around if you still want to post cool blogs, videos, etc!
Try implementing some of the models we learned in class on your own!
- Great book [Data Science From Scratch] (http://file.allitebooks.com/20150707/Data%20Science%20from%20Scratch-%20First%20Principles%20with%20Python.pdf) with code
- Text classification with Naive Bayes from Scratch [here] (https://web.stanford.edu/class/cs124/lec/naivebayes.pdf)
- Introduction to Statistical Learning book Videos here
- PCA by hand here
Take a look at the Resources for each class to get a deeper understanding of what we've learned. Trust me there are a lot
Follow data scientists on Twitter. This will help you stay up on the latest news/models/applications/tools.
Read blogs to keep learning. I really like District Data Labs and Data Elixer.
There are some active Python Data meetups in the area:
- SF Python
- SF Data Science
- SF Data Mining
- Request sponsorship for study groups through GA
- General GA Alumni Perks

Thank you all for such a wonderful time and I truly hope to stay in touch.

ovlas26/SF_DAT_17

SF DAT 17 Course Repository

Installation and Setup

Resources

Class 1: Introduction / Expectations / Intro to Data Science

Class 2: Git / Python

Resources:

Class 3: Pandas

Class 4 - More Pandas

Agenda

Class 5 - Intro to ML / Numpy / KNN

Class 6: scikit-learn, Model Evaluation Procedures

Class 7: Linear Regression

Class 8: Logistic Regression

Class 9: Working on a Data Problem

Class 10: Clustering and Visualization

Class 14: Ensembling

Class 15: Dimension Reduction

Class 16: Neural Networks and SVM

Class 18: Recommendation Engines

Class 19: More Neural Networks

Class 20: Databases and SQL

Next Steps