STA 380: Predictive Modeling
Welcome to part 2 of STA 380, a course on predictive modeling in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.
Office hours
On Friday, July 29th, I will hold office hours from 10am to 12pm (normal class time). I will start in my office (CBA 6.478), but if a lot of folks show up at once, we'll move to the regular classroom.
On Tuesday (8/2) and Thursday (8/4), I will hold office hours from 9-10 AM in CBA 6.478.
Scribe notes and exercises
To submit your scribe report, please e-mail me link to a .pdf or .md file on your own GitHub page (james.scott at mccombs.utexas.edu). Do not send an attachment.
You can find the up-to-date collection of scribe notes here.
The first set of exercises is available here.
Topics
(0) The data scientist's toolbox
Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.
Readings:
- a few introductory slides
- Jeff Leek's guide to sharing data
- Introduction to RMarkdown
- Introduction to GitHub
(1) Foundations of probability
Basic probability, and some fun examples. Random variables, probability distributions, expected value. Joint, marginal, and conditional probability. Independence. Law of total probability. Bayes' rule.
Readings:
- excerpts from an in-progress book on probability.
Some optional stuff:
- Bayes and the search for Air France 447.
- YouTube video on Bayes and the USS Scorpion.
- Pretty-but-wrong visualization by the New York Times on the long-term failure rates of various contraceptive methods, together with James Trussell's explanation of why the 10-year numbers are wrong. His quote is about halfway down the page. A great example where assuming independence can lead to trouble!
(2) Exploratory analysis
Contingency tables; basic plots (scatterplot, boxplot, histogram); lattice plots; basic measures of association (relative risk, odds ratio, correlation, rank correlation)
Scripts and data:
Readings:
- excerpts from my course notes on statistical modeling
- NIST Handbook, Chapter 1.
- R walkthroughs on basic EDA: contingency tables, histograms, and scatterplots/lattice plots.
- Bad graphics
- Good graphics: scan through some of the New York Times' best data visualizations
(3) Resampling methods
The bootstrap and the permutation test; joint distributions; using the bootstrap to approximate value at risk (VaR).
Scripts:
Readings:
- ISL Section 5.2 for a basic overview.
- These notes on bootstrapping and the permutation test.
- Section 2 of these notes, on bootstrap resampling. You can ignore the stuff about utility if you want.
- This R walkthrough on using the bootstrap to estimate the variability of a sample mean.
- Another R walkthrough on the permutation test in a simple 2x2 table.
- Any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.
Optionally, Shalizi (Chapter 6) has a much lengthier treatment of the bootstrap, should you wish to consult it.
(4) Clustering
Basics of clustering; K-means clustering; hierarchical clustering.
Scripts and data:
Readings:
- ISL Section 10.1 and 10.3
- Elements Chapter 14.3 (more advanced)
- K means examples: a few stylized examples to build your intuition for how k-means behaves.
- Hierarchical clustering examples: ditto for hierarchical clustering.
- K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.
(5) Latent features and structure
Principal component analysis (PCA). If time: canonical correlation analysis; multi-dimensional scaling.
Scripts and data:
- pca_2D.R
- pca_intro.R
- congress109.R, congress109.csv, and congress109members.csv
- gasoline.R and gasoline.csv
- FXmonthly.R, FXmonthly.csv, and currency_codes.txt
- cca_intro.R, mmreg.csv, and mouse_nutrition.csv
Readings:
- ISL Section 10.2 for the basics
- Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor analysis, beyond what we covered in class.
- Elements Chapter 14.5 (more advanced)
(6) Text data
Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).
Scripts and data:
- textutils.R
- nyt_stories.R and selections from the New York Times.
- tm_examples.R and selections from the Reuters newswire.
- naive_bayes.R
- simple_mixture.R
- congress109_topics.R
Readings:
- Stanford NLP notes on vector-space models of text, TF-IDF weighting, and so forth.
- (Using the tm package)[http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf] for text mining in R.
- Dave Blei's survey of topic models.
- A pretty long blog post on naive-Bayes classification.
(7) Miscellaneous
Coverage of these topics will depend on the time available. Possibilities include: anomaly detection; label propagation; learning association rules; graph partitioning; partial least squares.
Scripts and data:
Readings: