statistical learning methods

methods = model + strategy + algorithm
- model = conditional probability distribution + decision function
- strategy = loss function (cost function)
- algorithm: solve optimization problems
supervised learning
- classification: identifying to which of a set of categories a new observation belongs, on the basis of a training set of data
- tagging problem (structure prediction):
  - input: observation sequence; output: tag sequence or status sequence
  - e.g., part of speech tagging
- regression: estimate relationship between a dependent variable and one or more independent variables
  - linear regression and multiple linear regression
unsupervised learning
semi-supervised learning
reinforcement learning

a function that maps an event or values of one or more variables onto a real number intuitively representing some "loss" associated with the event
- expected loss:
- smaller the loss, better the model --> goal: select model with smallest expected loss
most commonly used methods:
- 0-1 loss function:
- quadratic loss function:
- absolute loss function:
- logarithmic loss function (log-likelihood loss function):

empirical risk minimization (ERM)
- compute an empirical risk by averaging the loss function on the training set:
- e.g., maximum likelihood estimation (MLE)
- disadvantage: over-fitting when sample size is small
structural risk minimization (SRM)
- regularization: balancing the model's complexity against its success at fitting the training data (penalty term)
- e.g., maximum posterior probability estimation (MAP)

training error & test error
- note: L --> lost function
error rate: Lost = 0-1 loss function
- note: I --> indicator function: if y=f(x) I=0, o.w. I=1
accuracy:

regularization: introduce additional information (regularization term) to a general loss function --> in order to to prevent overfitting
- Occam's razor: a problem-solving principle --> among competing hypotheses, the one with the fewest assumptions should be selected
cross validation: a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set

measure of how accurately an algorithm is able to predict outcome values for previously unseen data
the performance of a machine learning algorithm is measured by plots of the generalization error values through the learning process (learning curves) --> smaller the generalization error, better the model
- generalization error bound: the generalization error is less than some error bound --> smaller the generalization error bound, better the model

generative model: learns the joint probability distribution p(x,y)
- e.g., Naive Bayes, Hidden Markov Model
- converge fast
discriminative model: learns the conditional probability distribution p(y|x) --> the probability of y given x
- e.g., K Nearest Neighbor, Perceptron, Decision Tree, Logistic Regression, Maximum Entropy, Support Vector Machine, Boosting Method, Conditional Random Field
- higher accuracy

there exists at least one hyperplane (line) in the feature space with all points from one class on one side of the line and all points from another class on the other side

decision tree: uses a decision tree as a predictive model: maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves)
- node:
  - internal node: test on an attribute
  - leaf node: class
- directed edge: the outcome of the test
- paths from root to leaf: classification rules
entropy:
- uncertainty of random variables: larger the entropy, greater the uncertainty
- used to decide which feature to split on at each step in building the tree
- based on the concept of entropy:
  the entropy of the random variable X:
  
  entropy is independent to the value of X, only depend on the distribution of X
- conditional entropy:
information gain (mutual information):
- H(D): 对数据集D进行分类的不确定性
- H(D|A): 特征A给定的条件下对数据集D进行分类的不确定性
- g(D|A): 由于特征A使得对数据集D的分类的不确定性减少的程度
information gain ratio:

rarezhang/statistical_learning_methods