STA 380: Intro to Machine Learning

Welcome to part 2 of STA 380, a course on machine learning in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.

Instructors:

Dr. James Scott. Office hours on M T W, 12:30 to 1:15 PM, CBA 6.478.
Dr. David Puelz. Office hours on M T W, 4-4:45p in CBA 6.444.

Exercises

The exercises are available here. These are due Monday, August 15th at 5 PM, U.S central time. Pace yourself over the next few weeks, and start early on the first couple of problems!

Outline of topics

(1) The data scientist's toolbox

Slides: The data scientist's toolbox
Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.

Readings:

Your assignment after the first class day is to get yourself up and running on GitHub, if you're not already.

(2) Probability: a refresher

Slides: Some fun topics in probability

Optional reference: Chapter 1 of these course notes. There's a lot more technical stuff in here, but Chapter 1 really covers the basics of what every data scientist should know about probability.

(3) Data visualization

Topics: plotting pitfalls; the grammar of graphics; data visualization with R.

Slides:

Data visualization

R materials:

Lessons 4-6 of Data Science in R: A Gentle Introduction. You'll find lesson 5 a bit basic so feel free to breeze through that. The main thing you need to take away from lesson 5 is the use of pipes (%>%) and the summarize function.
Some R examples can be found in datavis_intro.R and nycflights_wrangle.R.

(4) Neural networks: the basics

Intro to neural network slides here. Jupyter notebooks here.

(5) Clustering

Basics of clustering; K-means clustering; hierarchical clustering. Spectral clustering

Slides: Introduction to clustering.

Scripts and data:

Readings:

ISL Section 10.1 and 10.3 or Elements Chapter 14.3 (more advanced)
K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.

(6) Dimensionality reduction: PCA and tSNE

Principal component analysis (PCA). T-distributed stochastic neighbor embedding (tSNE).

Slides: Introduction to PCA and tSNE

Scripts and data for class:

Readings:

ISL Section 10.2 for the basics or Elements Chapter 14.5 (more advanced)
Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor models, beyond what we covered in class.

(7) Networks and association rules

Networks and association rule mining.

Slides: Intro to networks. Note: these slides refer to "lastfm.R" but this is the same thing as "playlists.R" below.

Software you'll need:

Gephi, a great piece of software for exploring graphs
The Gephi quick-start tutorial

Scripts and data:

medici.R and medici.txt
playlists.R and playlists.csv
microfi.R, microfi_households.csv, and microfi_edges.txt.

(8) Text data

Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).

Slides:

Scripts and data:

(9) Treatment effects

Treatment effects; multi-armed bandits and Thompson sampling; high-dimensional treatment effects with the lasso.

Slides:

Treatments.

Scripts and data:

mab.R and Ads_CTR_Optimisation.csv
hockey.R and all files in data/hockey/
smallbeer.R and smallbeer.csv

vaap1997/STA380

STA 380: Intro to Machine Learning

Exercises

Outline of topics

(1) The data scientist's toolbox

(2) Probability: a refresher

(3) Data visualization

(4) Neural networks: the basics

(5) Clustering

(6) Dimensionality reduction: PCA and tSNE

(7) Networks and association rules

(8) Text data

(9) Treatment effects