MACHINE LEARNING
Notes
Applications in which the training data comprises of input vectors
along with the corresponding target vectors.
Applications in which the aim is to assign the input vector to one of
a finite numer of discrete categories.
For ex: Classification of digits in digit-recognition prolem.
Similar to Classification problem, but if the desired output consists
of one or more continuous variables then the task is called Regression
For ex: If the desired o/p is 'yield' of a chemical manufacturing
plant where the input vector is temperature, pressure,
conc. of reactants etc.
In other pattern recognition problems, the trainig data consists
of a set of input vectors X without any corresponding target vectors.
This is called Unsupervised Learning problem.
Unsupervised learning problems where the goal is to discover groups
of similar examples within the data.
Unsupervised learning problem where the goal is to determine the
distribution of data within the input space.
It is concerned with the problem of finding a suitable actions
to take in a given situation to maximize a reward.
--------------------------------------------------------------
Here the learning algorithm is not given the examples of optimal
output, in contrast to Supervised Learning, but must instead
discover them by Trial & Error.
Typically there is a sequence of states & actions in which the
learning algorithm is interacting with its environment.
In many cases, the current action not only affects the immediate
reward but also has an impact on the rewards at all subsequent time
steps.
For eg: Using appropriate reinforcement learning techniques a
Neural Network can learn to play a game of chessboard to a very
high standard.
A general feature of reinforcement learning is a trade-off b/w
1. EXPLORATION: System tries out new kinds of actions to know how
effective they are
2. EXPLOITATION: System uses known actions to yield a high reward
Functions which are linear in unknown parameters are known as
Linear Models.
For a given model complexity, the overfitting problem becomes less
severe as the size of data increases.
The larger the data-set, the more complex(in other words more flexible)
the model that we can afford to fit the data.
One rough heuristics to choose the data set is:
The number of data points should be no less than some multiple
of the no. of adaptive parameters in the model.
However, no. of parameters is not the only measure of model complexity.
Overfitting is an general problem of Maximum Likelihood Estimation(MLE)
and it can be avoided by choosing the Bayesian approach.
- Regularization
- Ridge Regression
- Weight Decay (in Neural Networks)
The average value of some function
f(x)under a probability distributionp(x)is called theExpectationoff(x)and is denoted byE[f].
$$E[f] = \sum_x{ p(x)f(x) }$$
(Average is weighted by relative probabilities of different values of
x)
$E_x[ f(x,y) ]$ denotes average of functionf(x,y)w.r.t distribution ofx. So,E[f(x,y)]will be a function iny.We can also define
Conditional Expectation(E[f|y]) w.r.t. conditional distribution (p(x|y)) in similar manner.
$$E[f|y] = \sum_x{ p(x|y)f(x) }$$
Variance var[*]:
It provides a measure of how much variability there is in
f(x)around its mean value (E[f(x)]).
$$var[f(x)] = E[( f(x) - E[f(x)] )^2 ]$$ $$var[X] = \sigma^2 = E[(X)^2] - (E[X])^2$$
Covariance cov[*]:
For two random variables
xandy, the covariance (cov[x,y]) is a measure of the extent to whichxandyvary together.$$cov[x,y] = E[xy] - E[x]E[y]$$ If
xandyaremutually independentthen their covariance iszero. In case of two vectors of random variables$X$ and$Y$ , covariance is a Matrix. Covariance of components of vector$X$ with each other :$$cov[X] = cov[X,X]$$
In ML literature,
the negative logof Likelihood function is calledError function$$Error function:-log(p(D|w))$$ In MLE estimation, we try to maximize the likelihood function
$(p(D|w))$ & b'coz$log()$ is a monotonically increasing function thus,maximizing likelihood minimizes error(remember: error is negative of log, so error is a monotonically decreasing function)
One common criticism of Bayesian viewpoint is that, the prior distribution is often selected on the basis of mathematical convenience rather than as a reflection of any prior beliefs.
Bayesian approach based on poor priors can give poor results with high confidence.
The Gaussian Distribution
For a single real-valued varaiable x, the Gaussian Distribution is defined as:
Mean = Variance = Standard Deviation = Precision
The
maximumof a distribution is calledModeFor a Gaussian Distribution,Modecoincides withMean($\mu$ )
Expectation or Mean : $$ E[x] = \int\limits_{-\infty}^{+\infty}\mathcal{N}(x|\mu, \sigma^2)\ x\ dx = \mu $$ and $$ E[x^2] = \int\limits_{-\infty}^{+\infty}\mathcal{N}(x|\mu, \sigma^2)\ x^2\ dx = (\mu^2 + \sigma^2) $$
So variance var[x] is,
$$ \sigma^2 = E[x^2] - E[x] $$
The Multivariate Gaussian Distribution
The Gaussian Distribution defined over a D-dimensional vector
where,
D-dimensionalvector of continuous variables
Mean D-dimensional vector
Covariance DxD Matrix
i.i.d=independent and identically distributed: Data points that are drawn independentally from the same distribution.Joint Probabilityofindependentevents is given by the product of themarginal probabilitiesof each event separately.
So, suppose we have a data-set of single-valued variable x:
Then, the probability of the data-set likelihood function of the Gaussian is given by:
Note: (
$\bold{X}$ is not a Vector like$\bold{x}$ , it is a collection of N observations of single valued variable x.
-
Binomial and Multinomial DistributionsforDiscrete random varables -
Gaussian distributionforContinuous random variables -
Parametric Distributions: Distributions that are governed by small no. ofadaptive parameters(like$\mu$ and$\sigma$ for Gaussian Distributions).One limitation of the parametreic approach is that it assumes a specific functional form of distribution which may turn out to be inappropriate for a particular application.
-
Density Estimation: Modelling the probability distribution$p(\bold{x})$ of a random variable$\bold{x}$ given a finite set${\bold{x_1, x_2, x_3........x_N}}$ ofobservations. -
The
conjugate priorfor the parameters ofmultinomial distributionis calledDiritchlet Distributions -
While, the
conjugate priorfor the Gaussian is another Gaussian. -
Exponential family of Distributions.
Non-parametric Density Estimation: Here, Distributions typically depend on the size of the data-set. Such models still haveparameters, but these control themodel complexityrather than the Distribution.
.......................................................................................................................................
It is a general property of Bayesian Learning that as we observe more and more data uncertainity represented by the posterior distribution
For eg: Consider a general Bayesian inference problem for a parameter data-set Joint Distribution
The above result says that the posterior mean of
Proof:
$$ E_{\theta}[\theta] = \int_{\theta}\theta\ p(\theta)\ d\theta $$
$$ E_D[E_{\theta}[\theta|D]] = \int_D\Bigg{\int_{\theta}\theta\ p(\theta|D)\ d\theta \Bigg}\ p(D)\ dD $$
Simplex: A bounded linear manifold.
Logistic Regression
- Sometimes, in case of small trainig data, learning
$w_0, w_i$ from the training data produces more promissing results as compared to learning$\mu, \sigma$