This is a series of scripts written while following the Machine Learning from Scratch Tutorials available on the channel named Python Engineer and available at https://www.youtube.com/playlist?list=PLqnslRFeH2Upcrywf-u2etjdxxkL8nl7E. n this tutorial, the instructor teaches how to implement popular machine learning algorithms while only using python and numpy without the use of additional libraries.

What I learnt:

K Nearest Neigbours:

  • A sample is classified by a popularity vote of its nearest neighbours. i.e. If k=3 and 2 of the 3 nearest points on a graph belong to the same class then the sample will be labelled as part of tht class.
  • For this to work, training samples must be provided (i.e. multiple different classes plotted on the graph)
  • In order to calculate distances we use Euclidiean Distance

KNN Demo from tutorial video

Linear Regression:

  • ŷ = wx + b (w = weights, b = bias)
  • To find weights and biases, a cost function is used:

MSE

  • Since this is the error, we want to minimize this so we need to find the minimum of this function. To do this we need to find the derivative:

MSE Derivative

  • This calculates the gradient of the cost function with respect to w and respect to b
  • Now we use gradient descent which is an iterative technique to find the minimum point:

Gradient Descent

  • "So we have some initialization of the weights and the bias and then we want to go into the direction of the steepest descent and the steepest descent is also the gradient so we want to go into the direction of the into the negative direction of the gradient and we do this iteratively until we finally reached the minimum"
  • To do this iteratively, we need some update rules:

Linear Reression - Update rules and derivatives

  • The learning rate is a very important parameter as a small learning rate might be slower but more accurate but a large learning rate might be faster but at the same time never find the minimum point.

Comparison of learning rates

Logistic Regression:

  • "In statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc"
  • In linear regression, we use the formula f(w,b) = wx + b which outputs continuous values. To change this into a probability, we use the sigmoid function:

Sigmoid Function

  • The approximations are as follows due to applying the sigmoid function f(w,b) = wx + b (w = weights, b = bias):

Logistic Regression Approximations

  • This will output a probability between 0 and 1
  • This is the cost function we use:

Logistic Regression + Cost Function

  • To optimize this formula, we use gradient descent again. These are the update rules and derivatives for the logistic regression algorithm:

Logistic Regression - Update rules + Derivatives

Naive Bayes Classifier

  • Based on the Bayes Theorem which states If we have two events A and B then the probability of event A given that B has already happened is equal to the probability of B given that A has happened times the probability of A divided by the probability of B:

Bayes Theorem

  • In our case, we use this like so:

How we will use the bayes theorem

  • We then use the chain rule to get the following:

How we will use the bayes theorem

  • Terminology:

    • P(y|X) is called the posterior probability
    • P(X|y) is called the class conditional probability
    • P(y) is called the prior probability of Y
    • P(X) is called the prior probability of X
  • It is called Naive Bayes because it assumes that all features (factors contributing to overall probability) are mutually independent which is unlikely in the real world

  • "For example if you want to predict the probability that a person is going out for a run given the feature that the sun is shining and also given the feature that the person is healthy, then both of these features might be independent but both contribute to this probability that the person goes out. In real life a lot of features are not mutually independent but this assumption works fine for a lot of problems"

  • We then have to select the class with the highest probability. We can therefore use the first formula given below. However, since we are only interested in y, we can ignore P(X). We then must use logarithms to get to the third formula provided below. We do this as all the probabilities will be between 0 and 1 so the final calculation will result in a very small number which could lead to overflow errors.

How to select the class with the highest probability

  • In the end, P(y) = frequency
  • The class conditional probability is calculated as follows:

Class Conditional Probability Calculation

Perceptron

  • The perceptron can be seen as one single unit of an artificial neural network
  • It is a simplified model of a biological neuron and it simulates the behavior of only one cell
  • Inputs (weighted and summed) -> Activation function -> output
  • In this code, we will be using step function as our activation function.