This repository consists of code sample of SMS Spam Classifier and a brief overview of ML algorithms.
Data Set Link: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
Machine Learning
Machine Learning (or ML) is an area of Artificial Intelligence (AI) that is a set of statistical techniques for problem solving.These techniques can be applied to a wide variety of problems which are not limited to - vision based research, fraud detection, price prediction, and even NLP.
Natural Language Processing
Natural Language Processing (or NLP) is an area that is a confluence of Artificial Intelligence and Linguistics.NLP has many tasks such as Text Generation, Text Classification, Machine Translation, Speech Recognition, Sentiment Analysis, etc.
Data Learning
Deep Learning (which includes Recurrent Neural Networks, Convolution Neural Networks and others) is a type of Machine Learning approach. It is an extension of Neural Networks.It can also be used for vision based classification and can be used in NLP.
Flow Chart of everthing ML
Naive Bayes Classifier
Bayes Theorem P(A|B)- how often A happens given that B happens= P(B|A)P(A)/P(B) P(B|A)- how often B happens given that A happens
Naive Bayes classifier calculates the probabilities for every factor.This classifier assumes the features (in this case we had words as input) are independent. Hence the word naive.
Sklearn Naive Bayes provides three alternatives for model training:
- Gaussian- it assumes that features follow a normal distribution
- Multinomial- used for discrete counts (eg: number of times outcome number x_i is observed over the n trials)
- Bernoulli- useful if your feature vectors are binary (i.e. zeros and ones), text classification with ‘bag of words’ model where 1s are word that occur in the document & 0s are words that do not.
Support Vector Machines(Classifier)
Separation of classes
Given a labelled training data, the algorithm outputs an optimal hyperplane which categorises new examples.
Tuning Parameters in SVM Classifier
( by varying these parameters we can achieve considerable non linear classification line with more accuracy in reasonable amount of time )
- Regularization Parameters
The Regularization parameter (C parameter in python’s sklearn library) tells the SVM optimization how much you want to avoid misclassifying each training example
- Gamma
The gamma parameter defines how far the influence of a single training example
- low gamma-points far away from plausible seperation line are considered in calculation for the seperation line
- high gamma-the points close to plausible line are considered in calculation.
- Margin
SVM to core tries to achieve a good margin.
A margin is a separation of line to the closest class points.
1.Good Margin 2. Bad Margin
-
Kernel
The learning of the hyperplane in linear SVM is done by transforming the problem using some linear algebra.
For linear kernel the equation for prediction for a new input using the dot product between the input (x) and each support vector (xi) is calculated as follows:
f(x) = B(0) + sum(ai * (x,xi))
Other kernels can be used that transform the input space into higher dimensions such as a Polynomial Kernel and a Radial Kernel. This is called the Kernel Trick.
- Polynomial Kernel
K(x,xi) = 1 + sum(x * xi)^d
(the degree d must be specified by learning algorithm)
- Radial Kernel
K(x,xi) = exp(-gamma * sum((x – xi^2))
(gamma must be specified, a good value of gamma=0.1, must be between 0 and 1)
SVC takes more training time than the Naive Bayes but the prediction is faster. However, it totally depends on scenario and data set which one performs best.