/stat418-tools-in-datascience

UCLA MAS STAT 418

Primary LanguageJupyter NotebookMIT LicenseMIT

Stats 418: Tools in Data Science

Stats 418 is a graduate level statistics course restricted to UCLA Masters in Applied Statistics students. The course will present current tools for data acquisition, transformation and analysis, data visualization, and machine learning and tools for reproducible data analysis, collaboration, and model deployment used by data scientists in practice. Advanced R packages and Python libraries, analytical databases, high-performance machine learning libraries, big data tools.

Course Description

Data Science has been vaguely defined and re-defined for the good part of the past decade, but has long history in both statistics and computing. What is clear is that the necessary 'tools' for a Data Scientist are changing at an incredbily rapid rate, deeming an exclusive course focusing on these tool a necessity. The prerequisites of this course, Data Management and Statistical Computing, in addition to the other courses in the MAS program, provide a solid base for a large portion Data Science daily work with this course buildling on that foundation. The course will present a large breadth of topics all under the overarching goal of building a data product(something akin to taking a data science project from conception to completion through deployment). These topics will all mirror things that are being used currently in industry by the top Data Scientists from unicorn startups to fortune 500 companies. Much of it will be applied in a learning-by-doing fashion while being presented with the resources to dive in when a topic necessitates it.

Key Dates

Problem set due dates will be announced as each problem set is distributed.

Other important deadlines and dates during the term are:

  • 5/7/19 data/proposal slide submissions (push to github)
  • Student data/proposal presentations (Week 6)
  • 6/4/19 Final slide submissions
  • Student final presentations (Week 10)

Resources

Papers will be posted in the corresponding weekly directory. There is no textbook.

Evaluation

Problem Sets

Working on actual problems is central to learning. Four problem sets will be assigned, on alternating weeks. These assignments will consist of analytical problems, computer simulations, and data analysis. Late submissions will not be accepted. Assignment will generally be made available by Tuesdays and due two Tuesdays later prior to lecture. All sufficiently attempted homework (ie. a typed and well organized write-up with all problems attempted) will be graded on a (+,✓,-) scale. Students are encouraged to discuss the problems together, but must independently produce and submit solutions. Work should be done as a RMarkdown file or a Jupyter notebook and committed to GitHub.

Final Project

A final project will be completed as individuals. The project will encourage collaboration, test your data acquisition skills, use your predictive modeling, challenge your programming ability and promote presenting skills. A proposal presentation with an acquired dataset, exploratory data analysis, and future direction with be presented during week 6. In addition, each individual will also present their work to the class during the final week of the quarter. Effective verbal communication is a critical skill for data scientist, and it requires practice and feedback to develop. Additional information about the final project will given as the course progresses, including the grading rubric.

Course Topics

Week 1 [4/2]: Introduction to course and each other. Overview of Data Science tools. Introduction and installation of Docker.

Week 2 [4/9]: Data Science in the command line. Learning about Unix. Reproducible research/work through git/Github and Docker.

Week 3 [4/16]: More Data Science in the command line. Analytical databases, SQL, NoSQL databases, MongoDB. Accessing databases through R(dbplyr) and Python(SQL Alchemy)

Week 4 [4/23]: Acquiring data through APIs and web-scraping. Rvest and Beautiful Soup in Python.

Week 5 [4/30]: Tools for data visualization: ggplot2, shiny (interactive web applications with R) / shiny dashboards and plotly.

Week 6 [5/7]: Final Project proposal presentations. Machine Learning libraries in both R and Python. Introduction to deep learning and AutoML.

Week 7 [5/14]: Continuation of Machine Learning libraries with use of cloud services. Introduction to NLP libraries.

Week 8 [5/21]: Building APIs for model deployment. Exploration of Plumber (R) and Flask or Falcon (Python)

Week 9 [5/28]: Continuation of API construction for model deployment. Buildling a Slackbot.

Week 10 [6/4]: Student Final Presentations.