DSCI 522: Data Science Workflows

Introduction

Data-rich projects can quickly grow out-of-hand and become irreproducible in the absence of deliberate effort at organization, tool choice, and process. This course will teach basic principles of sound data scientific workflows and will develop skills implementing them in appropriate state-of-the-art systems and languages (e.g., Python and R).

Course webpage: https://github.ubc.ca/MDS-2019-20/DSCI_522_dsci-workflows_students

Slack Channel: https://ubc-mds.slack.com/messages/522_dsci-workflows

Learning Outcomes

By the end of the course, students are expected to be able to:

  1. Map a data analysis question to appropriate analysis.
  2. Write R, Python and shell scripts for non-interactive data analysis.
  3. Run literate coding documents (Jupyter notebooks and R Markdown documents) non-interactively.
  4. Use a Git/GitHub forking-pull request collaboration approach to collaboratively work on a data analysis project.
  5. Automate data science workflows (using e.g., Make).
  6. Manage project software and environment dependencies (using e.g., Docker)

Teaching Team

Position Name Slack Handle GHE Handle Office Hours
Lecture Instructor Tiffany Timbers @tiffany @timberst Thursday at 12:45 - 13:45 - location posted in the calendar
Lab Instructor Firas Moosvi @Firas @Firasm NA
Teaching Assistant Javier Castillo-Arnemann @Javier NA Posted on Calendar
Teaching Assistant Ozum Kafaee @Ozum NA Posted on Calendar
Teaching Assistant Gary Zhu @Gary NA Posted on Calendar
Teaching Assistant Kate Sedivy-Haley @Kate NA Posted on Calendar

note - Attendance at office hours is optional

Project deadlines

This is a project-based course. You will work in randomly assigned groups of three (or four, if needed). You'll be evaluated as follows:

Assessment Weight Deadline Location
Milestone 1 - Proposal and data download scipt 10% 2020-01-18 @ 18:00 Submit to Github
Milestone 2 - Working analysis scripts and report draft 20% 2020-01-25 @ 18:00 Submit to Github
Milestone 3 - Data analysis pipeline with Make 20% 2020-02-01 @ 18:00 Submit to Github
Milestone 4 - Final project submission (with ultimate reproducibility) 30% 2020-02-08 @ 18:00 Submit to Github
Team work 20% 2020-02-11 @ 18:00 Submit to Github

Lab Details:

Lab Topic
1 Teamwork activity, Tagged releases, Semantic versioning
2 TBD
3 TBD
4 TBD

Schedule

Lecture Topic Required Readings Additional Readings
1 Introduction to Data Science Workflows
2 Scaling up: read-eval-print-loop (REPL) processes versus non-interactive scripts
3 Scaling up cont'd: using literate coding documents (Jupyter notebooks and R Markdown documents) non-interactively.
4 Data Analysis pipelines and shell scripting
5 Automated workflows; introduction to the build/automation tool Make
6 Environment management: containerization with Docker part I
7 Environment management: containerization with Docker part II
8 Environment management: containerization with Docker part III & Reproducibility wrap-up

Textbooks:

  • Art of Data Science by Roger Peng & Elizabeth Matsui (very cheap or even free!)
    • Note there are two packages, you only need to get the textbook ("The Book" package), you do not need to get the lecture videos!

Related fun?

  • Not so standard deviations podcast with co-hosts: Roger Peng of the Johns Hopkins Bloomberg School of Public Health and Hilary Parker of Stitch Fix.
  • Simply Statistics: A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Policies

Please see the general MDS policies.

UBC provides resources to support student learning and to maintain healthy lifestyles but recognizes that sometimes crises arise and so there are additional resources to access including those for survivors of sexual violence. UBC values respect for the person and ideas of all members of the academic community. Harassment and discrimination are not tolerated nor is suppression of academic freedom. UBC provides appropriate accommodation for students with disabilities and for religious and cultural observances. UBC values academic honesty and students are expected to acknowledge the ideas generated by others and to uphold the highest academic standards in all of their actions. Details of the policies and how to access support are available here.