STA 380: Predictive Modeling

Welcome to part 2 of STA 380, a course on predictive modeling in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.

Topics

The readings listed below are not yet complete, but the topics list is accurate.

(1) The data scientist's toolbox

Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.

Readings:

(2) Exploratory analysis

Contingency tables; basic plots (scatterplot, boxplot, histogram); lattice plots; basic measures of association (relative risk, odds ratio, correlation, rank correlation)

Scripts and data:

gdpgrowth.R and gdpgrowth.csv
titanic.R and TitanicSurvival

Readings:

NIST Handbook, Chapter 1.
R walkthroughs on basic EDA: contingency tables, histograms, and scatterplots/lattice plots.
Bad graphics
Good graphics: scan through some of the New York Times' best data visualizations

(3) Resampling methods

The bootstrap and the permutation test; using the bootstrap to approximate value at risk (VaR).

Scripts:

gonefishing.R and gonefishing.csv

Readings:

ISL Section 5.2 for a basic overview.
These notes, pages 99-111. This is an introduction to the bootstrap from the (by now familiar) perspective of linear regression modeling, but it conveys the essential idea.
This R walkthrough on using the bootstrap to estimate the variability of a sample mean.
Another R walkthrough on the permutation test in a simple 2x2 table.
Any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.

Optionally, Shalizi (Chapter 6) has a much lengthier treatment of the bootstrap, should you wish to consult it.

(4) Latent classes

K-means clustering; mixture models; hierarchical clustering.

Readings:

ISL Section 10.1 and 10.3
Elements Chapter 14.3 (more advanced)

(5) Latent features and structure

Principal component analysis (PCA); factor analysis; canonical correlation analysis; multi-dimensional scaling.

Readings:

ISL Section 10.2 for the basics
Shalizi Chapters 18 and 19 (more advanced)
Elements Chapter 14.5 (more advanced)

(6) Text data

Co-occurrence statistics; TF-IDF; topic models; vector-space models of text (if time allows).

Readings: TBA

(7) Miscellaneous

Coverage of these topics will depend on the time available. Possibilities include: anomaly detection; label propagation; learning association rules; graph partitioning; partial least squares.

Readings: TBA

xiaoyu7016/JScott-STA-380

STA 380: Predictive Modeling

Topics

(1) The data scientist's toolbox

(2) Exploratory analysis

(3) Resampling methods

(4) Latent classes

(5) Latent features and structure

(6) Text data

(7) Miscellaneous