JSC370: Data Science II (Winter 2024), University of Toronto

Where and When

Instructor: Meredith Franklin
- Email: meredith.franklin@utoronto.ca, please put "JSC370" in the subject line.
Teaching Assistant: Jun Ni (Jenny) Du junni.du@mail.utoronto.ca
Location: ESB 142
Time: Mondays and Wednesdays, 1-3pm
Office hours: By Appointment
Course Forum: Quercus
Course syllabus
Lab materials

Weekly Course Schedule

	Topics/Weekly Activities	Due Dates by 11:59 pm Fridays unless noted
Week 1 January 8 lecture pdf January 10 lab	Introduction to Data Science tools: R, markdown	Lab 1
Week 2 January 15 lecture pdf January 17 lab	Version Control & Reproducible Research, Git	Lab 2
Week 3 January 22 lecture pdf January 24 lab (sample solution)	Exploratory Data Analysis	Lab 3
Week 4 January 29 lecture pdf January 31 lab (sample solution)	Data visualization	HW1, Lab 4
Week 5 February 5 lecture pdf February 7 lab (sample solution)	Data cleaning and wrangling ML 1 advanced regression advanced regression solution	Lab 5
Week 6 February 12 lecture pdf February 14 lab (sample solution)	Regular Expressions, Data scraping, using APIs	HW2, Lab 6
Week 7 February 21	Reading Week
Week 8 February 26 lecture February 28 lab (sample solution)	Text mining	Lab 8
Week 9 March 4 lecture March 6 lab (sample solution)	High performance computing, cloud computing	Midterm, Lab 9
Week 10 March 11 lecture March 13 lab (sample solution, lab-b (optional) (sample solution)	ML 2 (trees, rf, xgboost)	Lab 10
Week 11 March 18 lecture March 20 lab11 (sample solution)	Interactive visualization and effective data communication I	HW3, Lab 11
Week 12 March 25 lecture March 27 lab12	Interactive visualization and effective data communication II	Lab 12
Week 13 April 1 lecture April 3	Final Project Workshop	HW4
Week 15 April 30		Final Project, HW5

Grading Breakdown

Task	% of Grade
Labs (including attendance)	10
Homework (5)	25
Midterm report	30
Final project	35

Resources

Markdown

The Plain Person’s Guide to Plain Text Social Science: Why you should write data-based reports using plain-text tools.
Markdown tutorial: An interactive tutorial to practice using Markdown.
Markdown cheatsheet: Useful one-page reminder of Markdown syntax.

Helpers and Templates

RMarkdown Cheatsheet An overview of Markdown and RMarkdown conventions.
RStudio Cheatsheets Other quick guides, including a more comprehensive RMarkdown reference and a information about using RStudio's IDE, and some of the main tools in R.

Guides

R Style Guide. Write readable code.
Jenny Bryan's Stat 545. Notes and tutorials for a Data Analysis course taught by Jennifer Bryan at the University of British Columbia. Lots of useful material.
knitr demos Documentation and examples for knitr by its author, Yihui Xie. There is also a knitr book covering the same ground in more detail.
Rmarkdown documentation from the makers of RStudio. Lots of good examples.
Plain Person's Guide The git repository for this project.
Karl Broman's Tutorials and Guides Accurate and concise guides to many of the tools and topics described here, including getting started with reproducible research, using git and GitHub, and working with knitr.
Makefiles for OCR and converting Shapefiles. Some further examples of Makefiles in the data-analysis pipeline, by Lincoln Mullen

Tools

Apple's Developer Tools Unix toolchain. Install directly with xcode-select --install, or just try to use e.g. git from the terminal and have OS X prompt you to install the tools.
Homebrew package manager. A convenient way to install several of the tools here, including Emacs and Pandoc.
R. A platform for statistical computing.
knitr. Reproducible plain-text documents from within R.
Python and SciPy. Python is a general-purpose programming language increasingly used in data manipulation and analysis.
RStudio. An IDE for R. The most straightforward way to get into using R and RMarkdown.
TeX and LaTeX. A typesetting and document preparation system. You can write files in .tex format directly, but it is more useful to just have it available in the background for other tools to use. The MacTeX Distribution is the one to install for macOS.
Pandoc. Converts plain-text documents to and from a wide variety of formats. Can be installed with Homebrew. Be sure to also install pandoc-citeproc for processing citations and bibliographies, and pandoc-crossref for producing cross-references and labels.
Git. Version control system. Installs with Apple's Developer Tools, or get the latest version via Homebrew.
GNU Make. You tell make what the steps are to create the pieces of a document or program. As you edit and change the various pieces, it automatically figures out which pieces need to be updated and recompiled, and issues the commands to do that. See Karl Broman's Minimal Make for a short introduction. Make will be installed automatically with Apple's developer tools.
lintr and flycheck. Tools that nudge you to write neater code.

Other Applications and Services

Backblaze. Secure off-site backup.
GitHub. Host public Git repositories for free. Pay to host private ones. Also a source for publicly available code (e.g. R packages and utilities) written by other people.
Marked 2. Live HTML previewing of Markdown documents. Mac OS X only.
Sublime Text. Python-based text editor.
Zotero, Mendeley, and Papers are citation managers that incorporate PDF storage, annotation and other features. Zotero is free to use. Mendeley has a premium tier. Papers is a paid application after a trial period. I don't use these tools much, but that's not for any strong principled reason---mostly just intertia. If you use one and want to integrate with the material here, just make sure it can export to BibTeX/BibLaTeX files. Papers, which I've used most recently, can handily output citation keys in pandoc's format amongst several others.

Data

Many of these websites have API to download the data. We recommend you using APIs to get data.

Health and Biological data

NIH Cancer Surveillance
World Health Organization WHO data
UniProt data
The Gene Ontology Project
US Center for Disease Control and Prevention Data
California Health and Human Services Open Data Portal
Covid Data CovidTracker

Academic Publications and related

Figshare data repository
Zenodo data repository
Harvard Dataverse
Elsevier Developers API

Other data

Toronto open data
Toronto Police Department
British Columbia open data
Ontario Data Catalogue
Los Angeles city data
Los Angeles crime data
Google Earth Engine
Google Dataset Search
FiveThirtyEight open data
World Bank open data
US Open Data Initiative DATA.GOV
US Census Data National Historical Geographic Information System (NHGIS)
Canada Census Data

Social Networks

Twitter Developers API
GitHub Developers API
Instagram Developers API
LinkedIn Developers API
Zillow Developers API
Spotify Developers API

JSC370/JSC370-2024

JSC370: Data Science II (Winter 2024), University of Toronto

Where and When

Weekly Course Schedule

Grading Breakdown

Resources

Markdown

Helpers and Templates

Guides

Tools

Other Applications and Services

Data

Health and Biological data

Academic Publications and related

Other data

Social Networks