reading notes of 统计学习方法 and A Course in Machine Learning
- methods = model + strategy + algorithm
- model = conditional probability distribution + decision function
- strategy = loss function (cost function)
- algorithm: solve optimization problems
- supervised learning
- classification: identifying to which of a set of categories a new observation belongs, on the basis of a training set of data
- tagging problem (structure prediction):
- input: observation sequence; output: tag sequence or status sequence
- e.g., part of speech tagging
- regression: estimate relationship between a dependent variable and one or more independent variables
- linear regression and multiple linear regression
- unsupervised learning
- semi-supervised learning
- reinforcement learning
- a function that maps an event or values of one or more variables onto a real number intuitively representing some "loss" associated with the event
- expected loss:
- smaller the loss, better the model --> goal: select model with smallest expected loss
- expected loss:
- most commonly used methods:
- empirical risk minimization (ERM)
- structural risk minimization (SRM)
- training error & test error
- note: L --> lost function
- error rate:
Lost = 0-1 loss function
- note: I --> indicator function:
if y=f(x) I=0, o.w. I=1
- note: I --> indicator function:
- accuracy:
- regularization: introduce additional information (regularization term) to a general loss function --> in order to to prevent overfitting
- Occam's razor: a problem-solving principle --> among competing hypotheses, the one with the fewest assumptions should be selected
- cross validation: a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set
- measure of how accurately an algorithm is able to predict outcome values for previously unseen data
- the performance of a machine learning algorithm is measured by plots of the generalization error values through the learning process (learning curves) --> smaller the generalization error, better the model
- generalization error bound: the generalization error is less than some error bound --> smaller the generalization error bound, better the model
- generative model: learns the joint probability distribution p(x,y)
- e.g., Naive Bayes, Hidden Markov Model
- converge fast
- discriminative model: learns the conditional probability distribution p(y|x) --> the probability of y given x
- e.g., K Nearest Neighbor, Perceptron, Decision Tree, Logistic Regression, Maximum Entropy, Support Vector Machine, Boosting Method, Conditional Random Field
- higher accuracy
- there exists at least one hyperplane (line) in the feature space with all points from one class on one side of the line and all points from another class on the other side
-
Perceptron
supervised learning; binary classifiers; only works for linearly separable data; allows for online learning; error driven -
K-Nearest Neighbor
a non-parametric method, lazy learning, non-linear classifier- feature space:
- output space:
- feature space --> output space: majority voting rule
note: I --> indicator function:if y=c I=1, o.w. I=0
- K:
- small k (decrease approximation error; increase estimation error; sensitive to neighbors) --> complicated model --> prone to over-fitting
- large k (decrease estimation error; increase approximation error) --> simple model --> prone to under-fitting
- k-dimensional tree (kd tree): a space-partitioning data structure for organizing points in a k-dimensional space
- feature space:
-
Naive Bayes
probabilistic classifiers, strong (naive) independence assumptions between the features -
Decision Tree
interpretability, fast algorithm
- decision tree: uses a decision tree as a predictive model: maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves)
- node:
- internal node: test on an attribute
- leaf node: class
- directed edge: the outcome of the test
- paths from root to leaf: classification rules
- node:
- entropy:
- uncertainty of random variables: larger the entropy, greater the uncertainty
- used to decide which feature to split on at each step in building the tree
- based on the concept of entropy:
the entropy of the random variable X:
entropy is independent to the value of X, only depend on the distribution of X
- conditional entropy:
- uncertainty of random variables: larger the entropy, greater the uncertainty
- information gain (mutual information):
- H(D): 对数据集D进行分类的不确定性
- H(D|A): 特征A给定的条件下对数据集D进行分类的不确定性
- g(D|A): 由于特征A使得对数据集D的分类的不确定性减少的程度
- information gain ratio:
-
Latent Dirichlet allocation
-
Monte Carlo