Here you will find the curriculum for the 11 week Data Science course at GA.
Instructor: Anthony Erlinger
Teaching Assistants: Josh Schneier and Deepti Gottipati
Don't submit assignments here! The student repository for submitting assignments can be found at https://github.com/ga-students/DAT_20_Students.
Please use this form when submitting assignments: http://goo.gl/forms/YZWvn9MUlt
Course Description:
This course is a practical approach to the knowledge and skills required to excel in the field of data science. Through various case studies, real-world examples and guest speakers, students will be exposed to the basics of data science, fundamental modeling techniques, and various other tools to make predictions and decisions about data. Students will gain practical computational experience by running machine learning algorithms and learning how to choose the best and most representative data models to make predictions. Students will be using Python throughout this course.
Prerequisites:
- Some experience with programming languages (preferably R or Python) and familiarity with the command line interface (UNIX).
- Laptop with OSX (Mac) or UNIX/Linux operating system
Students are expected to complete approx **10 hours of prework** before the course begins as outlined in this [pre work document](./Assignments/PreWork/ds_pre_work.pdf)
The typical structure of each session is 40% Lecture, 60% Exercises/Labs
- Install the Anaconda distribution of Python 2.7.x
- Install Git and create a GitHub account.
- Once you receive an email invitation from Slack, join our "Data Science 20 team" and add your photo!
The premodel workflow: Mining and Representing Data (5 Lectures)
Tuesday | Thursday |
---|---|
(No Class) | Lecture 1 (3/12) Introduction |
Lecture 2 (3/17) Git and Python | Lecture 3 (3/19) Mining data from the web |
Lecture 4 (3/24) Statistics with Pandas and Numpy | Lecture 5 (3/26) Visualizing Data |
Learning from Data: Building Predictive Models (11 Lectures)
Tuesday | Thursday |
---|---|
Lecture 6 (3/31) Machine Learning With SKLearn | Lecture 7 (4/2) Linear Regression |
Lecture 8 (4/7) Polynomial Regression, and the problem of Overfitting | Lecture 9 (4/9) Logistic Regression |
Lecture 10 (4/14) Text Analysis with Naive Bayes, Brief intro to Natural Language Processing | Lecture 11 (4/16) Model Evaluation and Cross Validation Strategies |
Lecture 12 (4/21) Decision Trees | Lecture 13 (4/23) Support Vector Machines and the basics of the kernel space |
Lecture 14 (4/28) PCA and dimensionality reduction | Lecture 15 (4/30) K-means clustering and KNN |
Lecture 16 (5/5) Ensemble Learning and Random Forests |
Intro to Data Engineering: Processing Data at Scale (6 Lectures)
Tuesday | Thursday |
---|---|
Lecture 17 (5/7) Querying data in Relational Databases | |
Lecture 18 (5/12) Recommender Systems and Network Analysis | Lecture 19 (5/14) Processing data at Scale Using Map Reduce |
Lecture 20 (5/19) Working Session for Final Project | Lecture 21 (5/21) Open Session: Speaker or Course Review |
Lecture 22 (5/26) Project Presentations |
####Each lesson in this curriculum contains the following:
- Agenda
- Slides
- In-Class Exercises (which can include code)
- Additional Resources
The final project should represent significant original work applying data science techniques to an interesting problem. Final projects are individual attainments, but you should be talking frequently with your instructors and classmates about them.
Address a data-related problem in your professional field or a field you're interested in. Pick a subject that you're passionate about; if you're strongly interested in the subject matter it'll be more fun for you and you'll produce a better project!
To stimulate your thinking, there is an excellent list of public data listed below. Using public data is the most common choice. If you have access to private data, that's also an option, though you'll have to be careful about what results you can release. You are also welcome to compete in a Kaggle competition as your project, in which case the data will be provided to you.
You should also take a look at past projects from other GA Data Science students, to get a sense of the variety and scope of projects.
You will be assigned to review the project drafts of two of your peers. You will have one week to provide them with feedback. You should upload your feedback as a Markdown (or plain text) document to the "reviews" folder of DAT_20. If your last name is Smith and you are reviewing Jones, you should name your file smith_reviews_jones.md
.
Expectations:
- Read everything they wrote!
- If they provided their data, review it and try to understand it.
- Read their code and try to understand their thought process.
- If their code can be run, try running it.
- Spend at least one hour reviewing their project (including the time it takes to write the feedback).
Your feedback would ideally consist of:
- Strengths of their project (things you particularly like about it)
- Comments about things you think could be improved
- Questions about things you don't understand
- Comments about their code
- Suggestions for next steps
- Guiding principle: Give feedback that would be helpful to you if it was your project!
You should take a quick glance through their project as soon as possible, to make sure you understand what they have given you and what files you should be reviewing. If you're unclear, ask them about it!
You are responsible for creating a project paper and a project presentation. The paper should be written with a technical audience in mind, while the presentation should target a more general audience. You will deliver your presentation (including slides) during the final week of class, though you are also encouraged to present it to other audiences.
Here are the components you should aim to cover in your paper:
- Problem statement and hypothesis
- Description of your data set and how it was obtained
- Description of any pre-processing steps you took
- What you learned from exploring the data, including visualizations
- How you chose which features to use in your analysis
- Details of your modeling process, including how you selected your models and validated them
- Your challenges and successes
- Possible extensions or business applications of your project
- Conclusions and key learnings
Your presentation should cover these components with less breadth and less depth. Focus on creating an engaging, clear, and informative presentation that tells the story of your project.
You should create a GitHub repository for your project that contains the following:
- Project paper: any format (PDF, Markdown, etc.)
- Presentation slides: any format (PDF, PowerPoint, Google Slides, etc.)
- Code: commented Python scripts, and any other code you used in the project
- Data: data files in "raw" or "processed" format
- Data dictionary (aka "code book"): description of each variable, including units
If it's not possible or practical to include your entire dataset, you should link to your data source and provide a sample of the data. (GitHub has a size limit of 100 MB per file and 1 GB per repository.) If your data is private, you can either include an "anonymized" version of your data or create a private GitHub repository.
See the Resources folder
- Python
- Anaconda Python Distribution
- Learn Python in X Minutes
- Learn Python the Hard Way
- Learn Python (interactive)
- Google's Python Class
- The Python Tutorial
- IPython
- The Python Package Index
- SciPy
- NumPy
- Matplotlib
- pyvideo.org
- Wolfram Alpha
- Jake Hofman Data Links
- Peter Skomoroch (Linkedin) Data Links
- Hilary Mason (bitly) Data Links
- Wikipedia Database
- IMDB Data
- Last.fm Database
- Quandl
- Datamob
- Factual
- Metro Boston Data Common
- Census.gov
- Data.gov
- Dataverse Network
- Infochimps
- Linked Data
- Guardian DataBlog
- Data Market
- Reddit Open Data
- Climate Data Sources
- Climate Station Records
- CDC Data
- World Bank Catalog
- Free SVG Maps
- Office for National Statistics
- StateMaster
- Open data catalogs from various governments and NGOs:
- NYC Open Data
- DC Open Data Catalog / OpenDataDC
- DataLA
- data.gov (see also: Project Open Data Dashboard)
- data.gov.uk
- US Census Bureau
- World Bank Open Data
- Humanitarian Data Exchange
- Sunlight Foundation: government-focused data
- ProPublica Data Store
- Datasets hosted by academic institutions:
- UC Irvine Machine Learning Repository: datasets specifically designed for machine learning
- Stanford Large Network Dataset Collection: graph data
- Inter-university Consortium for Political and Social Research
- Pittsburgh Science of Learning Center's DataShop
- Academic Torrents: distributed network for sharing large research datasets
- Datasets hosted by private companies:
- Quandl: over 10 million financial, economic, and social datasets
- Amazon Web Services Public Data Sets
- Kaggle provides datasets with their challenges, but each competition has its own rules as to whether the data can be used outside of the scope of the competition.
- Big lists of datasets:
- Rdatasets: collection of 700+ datasets originally distributed with R packages
- RDataMining.com
- KDnuggets
- inside-R
- 100+ Interesting Data Sets for Statistics
- 20 Free Big Data Sources
- APIs:
- Apigee: explore dozens of popular APIs
- Python APIs: Python wrappers for many APIs
- Other interesting datasets:
- FiveThirtyEight: data and code related to their articles
- Donors Choose: data related to their projects
- 200,000+ Jeopardy questions
- Other resources:
- Datasets subreddit: ask for help finding a specific data set, or post your own
- Center for Data Innovation: blog posts about interesting, recently-released data sets.
This is just the tip of the iceberg; there's a lot of data out there!