/sfdat22_work

Primary LanguageJupyter Notebook

SF DAT 22 Course Repository

Course materials for General Assembly's Data Science course in San Francisco, CA (3/29/16 - 6/9/16).

Instructors: Sinan Ozdemir Teaching Assistants: Mars Williams / Imeh Williams

Office hours:

W: 5:30pm - 7:30pm

Sa: 12pm-2pm

Su: 12pm-2pm

All will be held in the student center at GA, 225 Bush Street

Course Project Information

Course Project Examples

Tuesday Thursday Project Milestone HW
3/29: Introduction / Expectations / Intro to Data Science 3/31: Introduction to Git / Pandas
4/5: Pandas 4/7: APIs / Web Scraping 101 HW 1 Assigned (Th)
4/12: Intro to Machine Learning / KNN 4/14: Scikit-learn / Model Evaluation Question and Data Set (Th) HW 1 Due (Th)
4/19: Linear Regression 4/21: Logistic Regression
4/26: Time Series Data 4/28: Working on a Data Problem HW 2 Assigned (Th)
5/3: Clustering 5/5: Natural Language Processing HW 2 Due (Th)
5/10: Naive Bayes 5/12: Decision Trees One Pager Due (Th)
5/17: Ensembling Techniques 5/19: Dimension Reduction
Peer Review Due (Th) HW 3 Assigned (Th)
5/24 Support Vector Machines 5/26: Web Development with Flask HW 3 Due (Th)
5/31/16: Recommendation Engines 6/2: Neural Networks
6/7: Projects 6/9: Projects Git Er Done Git Er Done

Installation and Setup

  • Install the Anaconda distribution of Python 2.7x.
  • Install Git and create a GitHub account.
  • Once you receive an email invitation from Slack, join our "SF_DAT_17 team" and add your photo!

Resources

Class 1: Introduction / Expectations / Intro to Data Science / Python Exercises

####Agenda

  • Introduction to General Assembly slides
  • Course overview: our philosophy and expectations (slides)
  • Intro to Data Science: slides

Break -- Command Line Tutorial

  • Introduction on how to read and write iPython notebooks tutorial
  • Python pre-work here
  • Next class we will go over proper use of git and ipython notebooks in more depth

####Homework

  • Make sure you have everything installed as specified above in "Installation and Setup" by Thursday
  • Read this awesome intro to Git here
  • Read this intro to the iPython notebook here

--

Class 2: Introduction to Git / Pandas

####Agenda

  • Introduction to Git
  • Intro to Pandas walkthrough here
    • Pandas is an excellent tool for exploratory data analysis
    • It allows us to easily manipulate, graph, and visualize basic statistics and elements of our data
    • Pandas Lab!

####Homework

  • Go through the python file and finish any exercise you weren't able to in class
  • Make sure you have all of the repos cloned and ready to go
    • You should have both "sfdat22" and "sfdat22_work"
  • Read Greg Reda's Intro to Pandas
  • Take a look at Kaggle's Titanic competition

Resources:

  • Another Git turorial here
  • In depth Git/Github tutorial series made by a GA_DC Data Science Instructor here
  • Another Intro to Pandas (Written by Wes McKinney and is adapted from his book)
    • Here is a video of Wes McKinney going through his ipython notebook!

--

Class 3: Pandas

####Agenda

  • Don't forget to git pull in the sfdat22 repo in your command line
  • Intro to Pandas walkthrough here (same as last Thursdays)
  • Extended Intro to Pandas walkthrough here (new)

####Homework

  • Finish any lab questions that you did not finish in class
    • Make sure everything is pushed to sfdat22_work if you'd like us to take a look
  • make sure both requests and beautifulsoup are installed
    • To check, try import requests and import bs4 both work without error while running python!
  • Read this intro to APIs
  • Check out the National UFO Reporting Center here it will be one of the topics of the lab on Thursday

Resources:

--

Class 4: APIs / Web Scraping 101

####Agenda

  • I will also be using a module called tweepy today.
    • To install please type into your console conda install tweepy
      • OR if that does not work, pip install tweepy
  • Slides on Getting Data here
  • Intro to Regular Expressions here
  • Getting Data from the open web here
  • Getting Data from an API here
  • LAB on getting data here

####Homework

  • The first homework will be assigned by tomorrow morning (in a homework folder) and it is due NEXT Thursday (4/14)
    • It is a combo of pandas question with a bit of API/scraping
    • Please push your completed work to your sfdat22_work repo for grading

####Resources:

--

Class 5: Intro to Machine Learning / KNN

####Agenda

  • Iris pre-work code

    • Using numpy to investigate the iris dataset further
    • Understanding how humans learn so that we can teach the machine!
  • Intro to numpy code

    • Numerical Python, code adapted from tutorial here
    • Special attention to the idea of the np.array
  • Intro to Machine Learning and KNN slides

    • Supervised vs Unsupervised Learning
    • Regression vs. Classification
  • Lab to create our own KNN model

####Homework

  • The one page project milestone as well as the pandas homework! See requirements here
  • Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
    • In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
    • In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
    • In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
    • How does the choice of K affect model bias? How about variance?
    • As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
    • Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
    • Does a high value for K cause over-fitting or under-fitting?

Resources:

--

Class 6: scikit-learn, Model Evaluation Procedures

  • Introduction to scikit-learn with iris data (code)
  • Exploring the scikit-learn documentation: user guide, module reference, class documentation
  • Discuss the article on the bias-variance tradeoff
  • Look as some code on the bias variace tradeoff
    • To run this, I use a module called "seaborn"
    • To install to anywhere in your terminal (git bash) and type in sudo pip install seaborn
  • Model evaluation procedures (slides, code)
  • Glass Identification Lab here

Homework:

Optional:

  • Practice what we learned in class today! Finish up the Glass data lab

Resources:

--

Class 7: Linear Regression

  • Linear regression (notebook)

    • In depth slides here
  • LAB -- Yelp dataset here with the Yelp reviews data. It is not required but your next homework will involve this dataset so it would be helpful to take a look now!

Homework:

Resources:

--

Class 8: Logistic Regression

  • Logistic regression (notebook)
    • BONUS slides here (These slides go a bit deeper into the math)
  • Confusion matrix (slides)
  • LAB -- Exercise with Titanic data instructions

Homework:

Resources: