mlcourse: A Jupyter Notebook repository from krsapkota

Notable Changes from 2017 to 2018

New module on back propagation.
Added a note on conditional expectations, since many students find the notation confusing.
Added a note on the correlated features theorem for elastic net, which was basically a translation of Zou and Hastie's 2005 paper "Regularization and variable selection via the elastic net." into the notation of our class, dropping an unnecessary centering condition, and using a more standard definition of correlation.
Changes to EM Algorithm presentation: Added several diagrams (slides 10-14) to give the general idea of a variational method, and made explicit that the marginal log-likelihood is exactly the pointwise supremum over the variational lower bounds (slides 31 and 32)).
New worked example for predicting Poisson distributions with linear and gradient boosting models.

Added lecture on Black Box ML.
Treatment of the representer theorem is now well before any mention of kernels, and is described as an interesting consequence of basic linear algebra: "Look how the solution always lies in the subspace spanned by the data. That's interesting (and obvious with enough practice). We can now constrain our optimization problem to this subspace..."
The kernel methods lecture was rewritten to significantly reduce references to the feature map. When we're just talking about kernelization, it seems like unneeded extra notation.
Dropped most of the AdaBoost lecture, except to mention it as a special case of forward stagewise additive modeling with an exponential loss (slides 24-29).
Dropped the geometric derivation of SVMs and all mention of hard-margin SVM. It was always a crowd-pleaser, but I don't think it's worth the time. Seemed most useful as a review of affine spaces, projections, and other basic linear algebra.
Replaced the 1-hour crash course in Lagrangian duality with a 10-minute summary of Lagrangian duality, which I actually never presented and left as optional reading.
Rather than go through the full derivation of the SVM dual, in the new lecture, I just state the dual formulation and highlight the insights we get from the complementary slackness conditions, with an emphasis on the "sparsity in the data".
Added a note on the main takeaways from duality for the SVM.
Added a brief note on Thompson sampling for Bernoulli Bandits as a fun application for our unit on Bayesian statistics.
Significant improvement of the programming problem for lasso regression in Homework #2.
New written and programming problems on logistic regression in Homework #5 (showing the equivalence of the ERM and the conditional probability model formulations, as well as implementing regularized logistic regression).
Added lecture on standard methods of evaluating classifier performance.

Notable Changes from 2016 to 2017

New lecture on principal component analysis (Brett)
Added slide on k-means++ (Brett)
Added slides on explicit feature vector for 1-dim RBF kernel
Created notebook to regenerate the buggy lasso/elastic net plots from Hastie's book (Vlad)
L2 constraint for linear models gives Lipschitz continuity of prediction function (Thanks to Brian Dalessandro for pointing this out to me).
Expanded discussion of L1/L2/ElasticNet with correlated random variables (Thanks Brett for the figures)

Notable Changes from 2015 to 2016

New lecture on multiclass classification and an intro to structured prediction
New homework on multiclass hinge loss and multiclass SVM
New homework on Bayesian methods, specifically the beta-binomial model, hierarchical models, empirical Bayes ML-II, MAP-II
New short lecture on correlated variables with L1, L2, and Elastic Net regularization
Added some details about subgradient methods, including a one-slide proof that subgradient descent moves us towards a minimizer of a convex function (based on Boyd's notes)
Added some review notes on directional derivatives, gradients, and first-order approximations
Added light discussion of convergence rates for SGD vs GD (accidentally left out theorem for SGD)
For lack of time, dropped the curse of dimensionality discussion, originally based on Guillaume Obozinski's slides
New lecture (from slide 12) on the Representer Theorem (without RKHS), and its use for kernelization (based on Shalev-Shwartz and Ben-David's book)
Dropped the kernel machine approach (slide 16) to introducing kernels, which was based on the approach in Kevin Murphy's book
Added EM algorithm convergence theorem (slide 20) based on Vaida's result
New lecture giving more details on gradient boosting, including brief mentions of some variants (stochastic gradient boosting, LogitBoost, XGBoost)
New worked example for predicting exponential distributions with generalized linear models and gradient boosting models.
Deconstructed 2015's lecture on generalized linear models, which started with natural exponential families (slide 15) and built up to a definition of GLMs (slide 20). Instead, presented the more general notion of conditional probability models, focused on using MLE and gave multiple examples; relegated formal introduction of exponential families and generalized linear models to the end;
Removed equality constraints from convex optimization lecture to simplify, but check here if you want them back
Dropped content on Bayesian Naive Bayes, for lack of time
Dropped formal discussion of k-means objective function (slide 9)
Dropped the brief introduction to information theory. Initially included, since we needed to introduce KL divergence and Gibbs inequality anyway, for the EM algorithm. The mathematical prerequisites are now given here (slide 15).

Possible Future Topics

Quantile Regression (as part of loss functions homework or Notes); conditional prediction intervals more generally
Gaussian processes
- GP Hyperparameter tuning
- Metric-Optimized Example Weights
Active learning
Collaborative filtering / matrix factorization
Learning to rank and associated concepts
Simulation methods and more variational methods for probabilistic modeling
Reinforcement learning (minimal path to REINFORCE)
More depth on basic neural network stuff: weight initialization, vanishing / exploding gradient, possibly batch normalization
Density ratio estimation -- Ginsu knife of ML? (for covariate shift, anomaly detection, conditional probability modeling)
Bandits
- Importance weights / learning from logged data?
Something about causality?
Generalized additive models for interpretable nonlinear fits (smoothing way, basis function way, and gradient boosting way)
Finish up 'structured prediction' with beam search / Viterbi
- give probabilistic analogue with MEMM's/CRF's
Black box Feature importance measures
Naive Bayes vs Logistic Regression (Jordan & Ng, plus new experiments including regularization)

Citation Information

Machine Learning Course Materials by Various Authors is licensed under a Creative Commons Attribution 4.0 International License. The author of each document in this repository is considered the license holder for that document.

krsapkota/mlcourse

Notable Changes from 2017 to 2018

Notable Changes from 2016 to 2017

Notable Changes from 2015 to 2016

Possible Future Topics

Citation Information