statistical learning methods

reading notes of 统计学习方法 and A Course in Machine Learning

statistical learning methods

  • methods = model + strategy + algorithm
    • model = conditional probability distribution + decision function
    • strategy = loss function (cost function)
    • algorithm: solve optimization problems
  • supervised learning
    • classification: identifying to which of a set of categories a new observation belongs, on the basis of a training set of data
    • tagging problem (structure prediction):
      • input: observation sequence; output: tag sequence or status sequence
      • e.g., part of speech tagging
    • regression: estimate relationship between a dependent variable and one or more independent variables
      • linear regression and multiple linear regression
  • unsupervised learning
  • semi-supervised learning
  • reinforcement learning

loss function and cost function

  • a function that maps an event or values of one or more variables onto a real number intuitively representing some "loss" associated with the event
  • most commonly used methods:
    • 0-1 loss function:
      0-1 loss function
    • quadratic loss function:
      quadratic loss function
    • absolute loss function:
      absolute loss function
    • logarithmic loss function (log-likelihood loss function):
      logarithmic loss function

ERM and SRM

  • empirical risk minimization (ERM)
    • compute an empirical risk by averaging the loss function on the training set:
      ERM
    • e.g., maximum likelihood estimation (MLE)
    • disadvantage: over-fitting when sample size is small
  • structural risk minimization (SRM)
    • regularization: balancing the model's complexity against its success at fitting the training data (penalty term)
      SRM
    • e.g., maximum posterior probability estimation (MAP)

training error and test error

  • training error & test error
    training error
    test error
    • note: L --> lost function
  • error rate: Lost = 0-1 loss function
    error rate
    • note: I --> indicator function: if y=f(x) I=0, o.w. I=1
  • accuracy:
    accuracy
    error rate+accuracy

regularization and cross validation

  • regularization: introduce additional information (regularization term) to a general loss function --> in order to to prevent overfitting
    regularization
    • Occam's razor: a problem-solving principle --> among competing hypotheses, the one with the fewest assumptions should be selected
  • cross validation: a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set

generalization error

  • measure of how accurately an algorithm is able to predict outcome values for previously unseen data
    generalization error
  • the performance of a machine learning algorithm is measured by plots of the generalization error values through the learning process (learning curves) --> smaller the generalization error, better the model
    • generalization error bound: the generalization error is less than some error bound --> smaller the generalization error bound, better the model

generative model and discriminative model

  • generative model: learns the joint probability distribution p(x,y)
    • e.g., Naive Bayes, Hidden Markov Model
    • converge fast
  • discriminative model: learns the conditional probability distribution p(y|x) --> the probability of y given x
    • e.g., K Nearest Neighbor, Perceptron, Decision Tree, Logistic Regression, Maximum Entropy, Support Vector Machine, Boosting Method, Conditional Random Field
    • higher accuracy

linearly separable data set (linear separability)

  • there exists at least one hyperplane (line) in the feature space with all points from one class on one side of the line and all points from another class on the other side

distance metric

relation distance
distance metric

  • Euclidean distance: p=2
    euclidean
  • Manhattan distance: p=1
    manhattan
  • Max distance between coordinates: p=∞
    max coordinates

  1. Perceptron
    supervised learning; binary classifiers; only works for linearly separable data; allows for online learning; error driven

    • feature space:
      perceptron feature space
    • output space:
      perceptron output space
    • feature space --> output space:
      perceptron
      perceptron sign
    • separating hyperplane: w and b
      perceptron
    • loss function (empirical risk function):
      perceptron loss fun
      M: misclassified samples
    • optimization goal
      perceptron optimization goal
  2. K-Nearest Neighbor
    a non-parametric method, lazy learning, non-linear classifier

    • feature space:
      knn feature space
    • output space:
      knn output space
    • feature space --> output space: majority voting rule
      knn
      note: I --> indicator function: if y=c I=1, o.w. I=0
    • K:
      • small k (decrease approximation error; increase estimation error; sensitive to neighbors) --> complicated model --> prone to over-fitting
      • large k (decrease estimation error; increase approximation error) --> simple model --> prone to under-fitting
    • k-dimensional tree (kd tree): a space-partitioning data structure for organizing points in a k-dimensional space
  3. Naive Bayes
    probabilistic classifiers, strong (naive) independence assumptions between the features

    • feature space:
      nb feature space
    • output space:
      output space
    • feature space --> output space:
      • prior belief:
        priors
      • conditional probability: independence assumptions between the features
        conditional
      • posterior probability:
        posterior
      • classification:
        nb_fun
        simplify --> remove denominator:
        nb_fun_simple
  4. Decision Tree
    interpretability, fast algorithm

  • decision tree: uses a decision tree as a predictive model: maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves)
    decision tree
    • node:
      • internal node: test on an attribute
      • leaf node: class
    • directed edge: the outcome of the test
    • paths from root to leaf: classification rules
  • entropy:
    • uncertainty of random variables: larger the entropy, greater the uncertainty
      entropy increase
    • used to decide which feature to split on at each step in building the tree
    • based on the concept of entropy:
      the entropy of the random variable X:
      random variable
      entropy
      entropy is independent to the value of X, only depend on the distribution of X
      entropy2
    • conditional entropy:
      conditional entropy
  • information gain (mutual information):
    information gain
    • H(D): 对数据集D进行分类的不确定性
    • H(D|A): 特征A给定的条件下对数据集D进行分类的不确定性
    • g(D|A): 由于特征A使得对数据集D的分类的不确定性减少的程度
  • information gain ratio:
    information gain ratio
  1. Logistic Regression

  2. Support Vector Machine

  3. AdaBoost

  4. EM

  5. Hidden Markov Model

  6. Random Field

  7. Latent Dirichlet allocation

  8. Monte Carlo