Course materials for General Assembly's Data Science course in Washington, DC (12/15/14 - 3/16/15). View student work in the student repository.
Instructors: Sinan Ozdemir and Kevin Markham. Teaching Assistant: Brandon Burroughs.
Office hours: 1-3pm on Saturday and Sunday (Starbucks at 15th & K), 5:15-6:30pm on Monday (GA)
Monday | Wednesday |
---|---|
12/15: Introduction | 12/17: Python |
12/22: Getting Data | 12/24: No Class |
12/29: No Class | 12/31: No Class |
1/5: Git and GitHub | 1/7: Pandas Milestone: Question and Data Set |
1/12: Numpy, Machine Learning, KNN | 1/14: scikit-learn, Model Evaluation Procedures |
1/19: No Class | 1/21: Linear Regression |
1/26: Logistic Regression, Preview of Other Models |
1/28: Model Evaluation Metrics Milestone: Data Exploration and Analysis Plan |
2/2: Working a Data Problem | 2/4: Clustering and Visualization Milestone: Deadline for Topic Changes |
2/9: Naive Bayes | 2/11: Natural Language Processing |
2/16: No Class | 2/18: Decision Trees and Ensembles Milestone: First Draft |
2/23: Advanced scikit-learn | 2/25: Databases and MapReduce |
3/2: Recommenders | 3/4: Course Review, Companion Tools Milestone: Second Draft (Optional) |
3/9: TBD | 3/11: Project Presentations |
3/16: Project Presentations |
- Install the Anaconda distribution of Python 2.7x.
- Install Git and create a GitHub account.
- Once you receive an email invitation from Slack, join our "DAT4 team" and add your photo!
- Introduction to General Assembly
- Course overview: our philosophy and expectations (slides)
- Data science overview (slides)
- Tools: check for proper setup of Anaconda, overview of Slack
Homework:
- Resolve any installation issues before next class.
Optional:
- Review the code from Saturday's Python refresher for a recap of some Python basics.
- Read Analyzing the Analyzers for a useful look at the different types of data scientists.
- Subscribe to the Data Community DC newsletter or check out their event calendar to become acquainted with the local data community.
- Brief overview of Python environments: Python interpreter, IPython interpreter, Spyder
- Python quiz (solution)
- Working with data in Python
- Obtain data from a public data source
- FiveThirtyEight alcohol data, and revised data (continent column added)
- Reading and writing files in Python (code)
Homework:
- Python exercise (solution)
- Read through the project page in detail.
- Review a few projects from past Data Science courses to get a sense of the variety and scope of student projects.
- Check for proper setup of Git by running
git clone https://github.com/justmarkham/DAT-project-examples.git
- Check for proper setup of Git by running
Optional:
- If you need more practice with Python, review the "Python Overview" section of A Crash Course in Python, work through some of Codecademy's Python course, or work through Google's Python Class and its exercises.
- For more project inspiration, browse the student projects from Andrew Ng's Machine Learning course at Stanford.
Resources:
- Online Python Tutor is useful for visualizing (and debugging) your code.
- Checking your homework
- Regular expressions, web scraping, APIs (slides, regex code, web scraping and API code)
- Any questions about the course project?
Homework:
- Think about your project question, and start looking for data that will help you to answer your question.
- Prepare for our next class on Git and GitHub:
- You'll need to know some command line basics, so please work through GA's excellent command line tutorial and then take this brief quiz.
- Check for proper setup of Git by running
git clone https://github.com/justmarkham/DAT-project-examples.git
. If that doesn't work, you probably need to install Git. - Create a GitHub account. (You don't need to download anything from GitHub.)
Optional:
- If you aren't feeling comfortable with the Python we've done so far, keep practicing using the resources above!
Resources:
- regex101 is an excellent tool for testing your regular expressions. For learning more regular expressions, Google's Python Class includes an excellent regex lesson (which includes a video).
- Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
- Special guest: Nick DePrey presenting his class project from DAT2
- Git and GitHub (slides)
Homework:
- Project milestone: Submit your question and data set to your folder in DAT4-students before class on Wednesday! (This is a great opportunity to practice writing Markdown and creating a pull request.)
Optional:
- Clone this repo (DAT4) for easy access to the course files.
Resources:
- Read the first two chapters of Pro Git to gain a much deeper understanding of version control and basic Git commands.
- GitRef is an excellent reference guide for Git commands.
- Git quick reference for beginners is a shorter reference guide with commands grouped by workflow.
- The Markdown Cheatsheet covers standard Markdown and a bit of "GitHub Flavored Markdown."
- Pandas for data exploration, analysis, and visualization (code)
- Split-Apply-Combine pattern
- Simple examples of joins in Pandas
Homework:
Optional:
- To learn more Pandas, review this three-part tutorial, or review these three excellent (but extremely long) notebooks on Pandas: introduction, data wrangling, and plotting.
Resources:
- For more on Pandas plotting, read the visualization page from the official Pandas documentation.
- To learn how to customize your plots further, browse through this notebook on matplotlib.
- To explore different types of visualizations and when to use them, Choosing a Good Chart is a handy one-page reference, and Columbia's Data Mining class has an excellent slide deck.
- Numpy (code)
- "Human learning" with iris data (code, solution)
- Machine Learning and K-Nearest Neighbors (slides)
Homework:
- Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
- In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
- In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
- In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
- How does the choice of K affect model bias? How about variance?
- As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
- Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
- Does a high value for K cause over-fitting or under-fitting?
Resources:
- For a more in-depth look at machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
- Introduction to scikit-learn with iris data (code)
- Exploring the scikit-learn documentation: user guide, module reference, class documentation
- Discuss the article on the bias-variance tradeoff
- Model evaluation procedures (slides, code)
Homework:
- Keep working on your project. Your data exploration and analysis plan is due in two weeks!
Optional:
- Practice what we learned in class today!
- If you have gathered your project data already: Try using KNN for classification, and then evaluate your model. Don't worry about using all of your features, just focus on getting the end-to-end process working in scikit-learn. (Even if your project is regression instead of classification, you can easily convert a regression problem into a classification problem by converting numerical ranges into categories.)
- If you don't yet have your project data: Pick a suitable dataset from the UCI Machine Learning Repository, try using KNN for classification, and evaluate your model. The Glass Identification Data Set is a good one to start with.
- Either way, you can submit your commented code to DAT4-students, and we'll give you feedback.
Resources:
- Here's a great 30-second explanation of overfitting.
- For more on today's topics, these videos from Hastie and Tibshirani are useful: overfitting and train/test split (14 minutes), cross-validation (14 minutes). (Note that they use the terminology "validation set" instead of "test set".)
- Alternatively, read section 5.1 (12 pages) of An Introduction to Statistical Learning, which covers the same content as the videos.
- This video from Caltech's machine learning course presents an excellent, simple example of the bias-variance tradeoff (15 minutes) that may help you to visualize bias and variance.
- Linear regression (IPython notebook)
Homework:
- Keep working on your project. Your data exploration and analysis plan is due next Wednesday!
Optional:
- Similar to last class, your optional exercise is to practice what we have been learning in class, either on your project data or on another dataset.
Resources:
- To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning, from which this lesson was adapted. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
- To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on simple linear regression and multiple linear regression.
- This introduction to linear regression is much more detailed and mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.