Welcome to part 2 of STA 380, a course on predictive modeling in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.
The readings listed below are not yet complete, but the topics list is accurate.
Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.
Readings:
- a few introductory slides
- Jeff Leek's guide to sharing data
- Introduction to RMarkdown
- Introduction to GitHub
Contingency tables; basic plots (scatterplot, boxplot, histogram); lattice plots; basic measures of association (relative risk, odds ratio, correlation, rank correlation)
Scripts and data:
Readings:
- NIST Handbook, Chapter 1.
- R walkthroughs on basic EDA: contingency tables, histograms, and scatterplots/lattice plots.
- Bad graphics
- Good graphics: scan through some of the New York Times' best data visualizations
The bootstrap and the permutation test; using the bootstrap to approximate value at risk (VaR).
Scripts:
Readings:
- ISL Section 5.2 for a basic overview.
- These notes, pages 99-111. This is an introduction to the bootstrap from the (by now familiar) perspective of linear regression modeling, but it conveys the essential idea.
- This R walkthrough on using the bootstrap to estimate the variability of a sample mean.
- Another R walkthrough on the permutation test in a simple 2x2 table.
- Any basic explanation of the concept of value at risk (VaR) for a financial portfolio, e.g. here, here, or here.
Optionally, Shalizi (Chapter 6) has a much lengthier treatment of the bootstrap, should you wish to consult it.
K-means clustering; mixture models; hierarchical clustering.
Readings:
- ISL Section 10.1 and 10.3
- Elements Chapter 14.3 (more advanced)
Principal component analysis (PCA); factor analysis; canonical correlation analysis; multi-dimensional scaling.
Readings:
- ISL Section 10.2 for the basics
- Shalizi Chapters 18 and 19 (more advanced)
- Elements Chapter 14.5 (more advanced)
Co-occurrence statistics; TF-IDF; topic models; vector-space models of text (if time allows).
Readings: TBA
Coverage of these topics will depend on the time available. Possibilities include: anomaly detection; label propagation; learning association rules; graph partitioning; partial least squares.
Readings: TBA