The aim of this short 15-hour course is to present the fundamental philosophy behind literate programming to conduct a faithful and reproducible data analysis, using sound statistical procedures and modern data analytics tools. The course is based on Rstudio as IDE and using the R programming language for data analysis. Every lecture will be backed up with practical sessions and worked out examples.
- (First Session) General Introduction
- (First Session) Literate Programming - Literate Programming Motivation & RStudio Case Study
- Reproducibility and Literate Programming (PDF)
- Why R? RStudio. (PDF)
- Hands-on: Using RStudio for running a Statistical Analysis
- Given Example Analysis
- Data set #1: ping-pong measurements
- Data set #2: iteration duration of a geophysics application
- (Second Session) Data Carpentry and Manipulation - Clean-up data, and using the dplyr R package
- Introduction & Data Characterization (PDF)
- Tidy Data (PDF)
- Data Manipulation Workflow (groups, summarize) (PDF) + Example
- Hands-on: Given names in France - 2016 Edition
- Mid-term activity (Deadline: Saturday 28/10 at 23:59)
- (Third Session) Data Quality, Descriptive Statistics
- Discussion about POA accidents
- Data quality (criteria) Missing Data
- Descriptive Statistics : central tendency, variability
- Critical Analysis of a Plot (Homework): Choose a plot that has
been published in the Internet, News site, anywhere. Then, in a
Rmd file, provide a critical analysis about it. Put in your Git
repository and send us the link by e-mail, OR send us the Rmd
file directly by e-mail.
N Solutions 1 Rodrigo F. 2 Lizeth 3 Emmanuell 4 Lucas 5 Liza 6 Gabrielli 7 Matheus 8 Felipe 9 Rodrigo 10 Eduardo
- (Fourth Session) Data Visualization
- Discussion about last homework (critical plot analysis)
- Checklist for good graphics (Table)
- Data Viz with the ggplot2 package
- Probabilistic Modeling
- Law of large numbers, Central Limit Theorem (CLT)
- Presentation of Scientific Results
- Improve the POA Accidents Dataset plots (Homework): Get back to
the mid-term activity and improve/create your own plots using the
checklist for good graphics presented today.
N Solutions 1 Eduardo (PDF) 2 Lucas 3 Felipe 4 Rodrigo S. (Loteria) 5 Emmanuell (PDF)
- (Fifth Session) Statistics
- Duke of Tuscany Problem
- Model: The sum of 3 dice is modeled by a random variable with values in [3, .., 18] and probabilities P_3, …, P_18.
- Question: Is P_9 < P_10?
- Method: Estimation of P_9 and P_10, fix a level of confidence. Decide P_9 < P_10 if … (under which condition?)
- Analysis of the Duke of Tuscany
- Stick Breaking Problem
- Estimators - how to get information from samples
- Handson: Estimation Example
- Duke of Tuscany Problem
Day | Date | Hour | Room |
---|---|---|---|
0 | 24/10 (Tuesday) | 8:30 – 10:30 (2h) | Lab 67-104 |
1 | 25/10 (Wednesday) | 8:30 – 10:30 (2h) | Lab 67-104 |
2 | 30/10 (Monday) | 8:30 – 10:30 (2h) | Lab 67-104 |
3 | 31/10 (Tuesday) | 8:30 – 12:30 (4h) | Lab 67-103 |
4 | 01/11 (Wednesday) | 8:30 – 12:30 (4h) | Lab 67-103 / AUD-1 |
The deadline for the final project is the 15th of December, 2017.
Student | Dataset | |
---|---|---|
Eduardo | Boston Marathon 2017 | ok |
Liza | US Homicides | ok |
Fábio | Porto Alegre accidents | ok |
Gabrielli | Rainfall in India | ok |
Felipe | Online Retail Sales in Europe | ok |
Rodrigo F. | US Homicides | ok |
Lucas | Professional Hockey | ok |
Matheus | RS Homicide | ok |
Rodrigo N. | Video Game Sales | ok |
Lizeth | World Happiness | ok |
Emmanuell | Land usage and Agriculture & Climate change | ok |
- Literate Programming. Donald E. Knuth (Stanford, California) (CSLI Lecture Notes, no. 27.). ISBN 0-937073-80-6.
- Applied Statistics and Probability for Engineers 6th Edition. Douglas C. Montgomery (Author), George C. Runger. Wiley.
- R for Data Science. Garrett Grolemund, Hadley Wickham. http://r4ds.had.co.nz/
Get in touch with us