Data Science Course: Lectures and Materials
Issues: For questions, answers and discussions:
Viewing Your and Other Student Work:
iPython Notebook Viewer for this class's student repo
Git Workflow and Command Line Tips:
Class Meetings
Introduction to Data Science
Monday, 3/31/14
Class Materials
Data Collection and Extraction
Wednesday, 4/7/14
Project 1 Introduced
Class Materials
Additional Resources:
#####Learning how to use the file pager, less
Python Documentation
Handy to have this in your bookmarks!
python
introductions
Couple extra handy
Beautiful Soup Tutorials
APIs to play with
Numpy
Wednesday 4/9/2014
Class Materials
Additional Resources
- Watch the 5 minute "Ipython Notebook Tour"
- Review "What is NumPy"
- Watch Wes McKinney's 10 minute Whirlwind Tour of Pandas (even once is ok ;-) )
- Another great resource: Review Chapters 1 to 5 of Julia Evans Cookbook
Pandas
Monday 4/14/2014
Class Materials
Data Visualization and MatPlotLib
Wednesday 4/16/2014
Class Materials
Lecture Notes: Data Visualization
Python Notebook: Plotting with Matplotlib
Assignments Due
- Complete and submit previous assignments
Additional Resources
Resource | About |
---|---|
Basic Plotting in Pandas | |
Matplotlib userguide | |
Matplotlib Gallery | Examples with Code |
Rougier and Prace EuroSciPy Matplotlib Tutorial | Short Overview |
Exploratory Data Analysis
Monday 4/21/2014 We'll be reviewing a number of datasets and going through the Data Exploration Process
The ACES model for Data Exploration:
Letter | Step | Notes |
---|---|---|
A | Acquire the data and Assemble the data frame | Find data, import into Pandas |
C | Clean the data frame | Identify and limit columns, rows, indices, dates, etc. |
E | Explore global properties | Visualize! Basic plots and stats appropriate to the data set |
S | Subset comparisons | Look at (visualize!) initial emergenet variable relationships and subsets |
Class Materials
Resources
- EDA with SAT Scores
- Grouping with Pandas
- Data Wrangling Movies
- EDA Questions
- Volinksy EDA Presentation
Assignments Due
N/A - Please review all prior materials and work on Project 1.
Presentations, Machine Learning, and Data Science Careers
Wednesday 4/23/2014
Assignments Due
[Project 1: Scraping, APIs, and Data Visualization](Project 1https://github.com/datadave/GADS9-NYC-Spring2014-Lectures/blob/master/projects/project01.md)
Class Outline
- Selected Presentations of Student Projects
- Discussion of Data Science Careers
- Introduction to Machine Learning
Linear Regressions
Monday 4/28/2014 We'll be discussing the linear regression algorithm and learn about scoring regression models
Class Materials
Assignments Due
Please submit three optimized models using the data/day.csv
file in an ipython notebook or python script for each y variable casual, registered, and cnt. Please put this in a lab_submissions/lab07/yourname
folder.
More Reading
Resource | About |
---|---|
Regressions with Sklearn | |
Overfitting Regressions | |
Guide to Logistic Regression | |
Khan Academy Algebra Review | |
MIT OCW |
Naive Bayes
Wednesday 4/30/2014 We'll be reviewing some basics of probability, developing ways to work with text data, and using a classification algorithm to classify text.
Objectives
- Articulate Naive Bayes' advantages, flaws, applications and theoretical foundation
- Explain how Naive Bayes is applied to classify text or Spam
- Be familiar with using the N.B. classifiers in NLTK and SKLearn
- Create a basic Naive Bayes classifier
Materials
- NB_Gender_Names_NLTK: Notebook covering basics of Naive Bayes with single features
- NB_Biebama_NLTK: Demo: Classifying text as Obama or Bieber
- NB_Movies_SKLearn: Illustration of SK Learn NB functions
- NB_Movies_NTLK: Illustration of NB on text with NLTK
Assignments
- Add a feature to the NLTK gender classifier to try and improve performance
- Create a classifier to tell the difference between two authors
- Brainstorm classification topics for projects (due May 14)
Follow Up Notes
Based on student feedback:
Classifier Comparison and Logistic Regression
Wednesday 5/7/2014
Objectives
- Understand how to apply logistic regression to a classification problem
- Create a two dimensional feature space to evalute the performance of classifiers
- Leverage the interoperability of SKLearn classifiers to compare KNN, Naive Bayes, Decision Trees and Logistic Regression on a single classification problem
Materials
The lesson notebook provides:
- A brief background on logistic classification
- A mesh function using np.meshgrid to evaluate the predictive functions on a 2 dimensional feature grid
The intention is to provide a starting template with which to contrast various classifiers on a clean, real-world data set.
Assignments
Students are expected add additional classifiers to the notebook, experiment with parameters, and develop conclusions about the differences between classifier performance on the given sample data.
Your Centroid or Mine? An Introduction to K-Means
Monday, May 19th
Humans are good, often too good, at clustering, and its another realm of our intelligence that we can programatically apply to machines. Toddlers can tell that objects are boats, flags, and doggies -- but how?
Machine clustering is used to categorize the web, understand galaxies, organize genetic, segment customers, classify mental illness, and detect disease patterns, to name just a few applications.
K Means is a very simple algorithm for classifying that works well and is by far the most widely used.
Here's some resources to get started:
Recommended Resources
| Title | Author | Type | Length | Difficulty | Description | Rating (1 to 4 Stars) | ----- | ----- | ---- | ----- | ------ | --- | --- | --- | |Cluster Analysis and K-Means| Kumar, UMN | PDF Excerpt | 40 pages | Intermediate | Good chapter overview on clustering and then section "8.2.1 Basic K-Means Algorithm" gives great K-Means summary. The rest of 8.2 includes complications wth K-Means and concludes with the optimization math. If you just want the minimum, read pgs. 498 to 505 | ++++ | Clustering Overview | StanfordML | html page | 3 pages | Intermediate | Good, quick overview of everything | ++++ | K-Means Clustering | Mathematical Monk | Video | 15 minute | Novice | Good Kahn style overview of math | +++ | K-Means Wikipedia Entry | Everyone | Wikipedia | 6 pages | Intermediate | Includes Iris and 'mickey mouse' we'll be looking at. | ++
Review of Random Forests and the Ensemble Learning Approach
Wednesday, May 21st
We'll wrap up all machine learning material by having a more in detail discussion about the differences between bagging (bootstrapping), boosting, and random forests.
Recommended Resources
Title | Author | Type | Length |
---|---|---|---|
Ensemble Learning | Wikipedia | Article | --- |
sklearn doc | Scikey-learn | Documentation | --- |
yhat Blog on Random Forests | yHat | blog article | --- |
Ensemble Methods in Machine Learning | Dietterich, Thomas | PDF Journal | 15 pages |
A Few Useful Things to Know about Machine Learning | Domingos, Pedro | PDF Journal | 9 pages |
Ensemble Methods | Hyer, Jay | Presentation | 31 Slides |
Kaggle Random Forests | Kaggle | Kaggle | --- |
Updates expected -- See Lesson Folder for further details