/STA380

STA 380: Predictive Modeling

Primary LanguageR

STA 380: Predictive Modeling

Welcome to part 2 of STA 380, a course on predictive modeling in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.

Office hours

On Friday, July 29th, I will hold office hours from 10am to 12pm (normal class time). I will start in my office (CBA 6.478), but if a lot of folks show up at once, we'll move to the regular classroom.

On Tuesday (8/2) and Thursday (8/4), I will hold office hours from 9-10 AM in CBA 6.478.

Scribe notes and exercises

To submit your scribe report, please e-mail me link to a .pdf or .md file on your own GitHub page (james.scott at mccombs.utexas.edu). Do not send an attachment.

You can find the up-to-date collection of scribe notes here.

The first set of exercises is available here.

Topics

(0) The data scientist's toolbox

Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.

Readings:

(1) Foundations of probability

Basic probability, and some fun examples. Random variables, probability distributions, expected value. Joint, marginal, and conditional probability. Independence. Law of total probability. Bayes' rule.

Readings:

  • excerpts from an in-progress book on probability.

Some optional stuff:

(2) Exploratory analysis

Contingency tables; basic plots (scatterplot, boxplot, histogram); lattice plots; basic measures of association (relative risk, odds ratio, correlation, rank correlation)

Scripts and data:

Readings:

(3) Resampling methods

The bootstrap and the permutation test; joint distributions; using the bootstrap to approximate value at risk (VaR).

Scripts:

Readings:

  • ISL Section 5.2 for a basic overview.
  • These notes on bootstrapping and the permutation test.
  • Section 2 of these notes, on bootstrap resampling. You can ignore the stuff about utility if you want.
  • This R walkthrough on using the bootstrap to estimate the variability of a sample mean.
  • Another R walkthrough on the permutation test in a simple 2x2 table.
  • Any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.

Optionally, Shalizi (Chapter 6) has a much lengthier treatment of the bootstrap, should you wish to consult it.

(4) Clustering

Basics of clustering; K-means clustering; hierarchical clustering.

Scripts and data:

Readings:

(5) Latent features and structure

Principal component analysis (PCA). If time: canonical correlation analysis; multi-dimensional scaling.

Scripts and data:

Readings:

  • ISL Section 10.2 for the basics
  • Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor analysis, beyond what we covered in class.
  • Elements Chapter 14.5 (more advanced)

(6) Text data

Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).

Scripts and data:

Readings:

(7) Miscellaneous

Coverage of these topics will depend on the time available. Possibilities include: anomaly detection; label propagation; learning association rules; graph partitioning; partial least squares.

Scripts and data:

Readings: