ml-stanford-coursera: A MATLAB repository from sjonnala

#Excercises for the course Machine Learning by Stanford

The exercises correspond to the course available through Coursera from September through November 2016.

These are my solutions to the programming assignments.

Week 4 - Neural Networks: Representation Week 3 - Logistic Regression Week 2 - Linear Regression with Multiple Variables

##Week 4 Neural Networks Representation

Overview of the solution

This week, we implemented a one-vs-all logistic regression classification algorithm to recognize handwritten digits. We also used a neural network to predict the digits given a layer of pre-learned weights and applying the feedforward propagation algorithm.

One vs all

In the one vs all method, we want to train a classification algorithm for each of the possible digits from 1-10. For this, we first had to do the vectorized implementation of logistic regression which included the vectorized implementation of the cost function and the gradient function.

Once we had these implementations, we used the fmincg function from octave which performs better than the fminunc with a large number of parameters.

Something really interesting, was that we would get the probability of the digit being any of the digits from 1-10 and we would use the max function to get the index of the digit represented that had the maximum probability.

predictions_for_each_k = sigmoid( X * all_theta' );
[k_probability, k_value_predicted] = max( predictions_for_each_k, [], 2);
p = k_value_predicted;

The algorithm classified with 94.9% accuracy correctly on the training set.

Predicting digits

Neural Networks

For the neural networks, we were given a pre-trained set of Theta1 and Theta2 that would be used to implement the feedforward propagation algorithm.

We only had to complete the prediction code to calculate the output of the hidden layer and the output layer.

Neural Network

Then, the code provided would randomly create samples and predict using our code with a 97.5% accuracy.

Predictions in real time

##Week 3 Logistic Regression

Overview of the solution

This week, we solved two problems. One, was to predict if a student would be admitted or not to a certain college given the results of two admission exams and historic data on the acceptance of other students.

The second problem, was to predict if a microchip at a factory should be accepted or rejected depending on two tests. In this example, we applied regularization.

Visualization

The first step to understand the problem, was to visualize the data.

Visualization of sample data

Sigmoid Function

We then had to create a function that is able to calculate the sigmoid for a vector or a matrix. Instead of looping for each element, I did the vectorized implementation using the dot division for each element.

The Sigmoid Function

Sigmoid Function

Vectorized implementation

The vectorized implementation of the sigmoid function was one line of code 😎

g = 1 ./ (1 + exp(-z));

Cost Function and gradient

To calculate the cost and gradient, I also used vectorized implementation. The formulas for the hypothesis, the cost and the gradient vectorized are:

Hypothesis for logistic regression vectorized

Cost function vectorized

Gradient vectorized

The code was also very short as a result:

h = sigmoid(X*theta);
J = (1/m) * (-y' * log(h) - (1-y)' * log(1 - h));

grad = (1/m) * X' * (h - y);

Prediction

Given a dataset of samples, we want to find the prediction results from our hypothesis with the theta that gives the lowest cost.

The objective was to return 1 if the prediction was above or equal to the threshold of 0.5 or return 0 if it was less than 0.5.

I managed to do this in Octave in one line of code as well by doing the vectorized sigmoid hypothesis and comparing it to 0.5.

p = sigmoid(X*theta) >= 0.5;

Microchip Classification

Visualizing the data

How the data looks Visualizing the microchip data

Feature Mapping

Mapping Features to a six degree polynomial

The data could not be linearly separated, which meant we had to create a more complex polynomial that could fit the data based on the current features. For this, we elevated the features to a sixth degree polynomial.

Feature Mapping

Cost Function and Gradient

Cost Function with Regularization The regularization parameter is added to the cost function. To calculate the regularization parameter, we actually exclude the first theta0 as this should not be regularized.

Cost Function

Gradient with Regularization To calculate the gradient with regularization, we calculate the regularization parameter and add it to all the gradient except for theta0 where we don't add anything.

Gradient for theta 0

Gradient for theta j>1

Underfitting (high bias), Just Right and Overfitting (high variance)

Once we were able to add regularization, I experimented with lambda being very low (0), 1 and very high (100). As a result, we can see how the decision boundary performs in each case and what was the best lambda for the just right results. We got an accuracy of 83% on the dataset with lambda 1.

High Bias λ=0	Just Right λ=1	High Variance λ=100

##Week 2 Linear Regression with Multiple Variables & Octave/Matlab Tutorial

Overview of the solution

This week, we calculated the profit of a food truck company based on the data of profits each food truck has in different cities and their corresponding populations.

Profit of food trucks by city population

The mandatory exercises were of gradient descent with one feature and the optional ones have multiple features.

I solved the gradient descent with one feature doing an iteration over the sum of the prediction deviations and then over the number of features and then over the number of iterations that actually gradient descent run through.

Gradient Descent algorithm

I also solved it with the vectorized/matriz implementation which is much quicker.

Gradient Descent algorithm

Before that, I had to calculate the cost function which I did using the vectorized method.

Cost Function

Result

The gradient descent was able to predict after 1500 iterations, the best values for theta that would converge in the minimum. The corresponding hypothesis visualized among the data looks like this.

Gradient Descent's Matrix implementation

With these results, we were able to predict the profit for a food truck given a city with a different population.

Hypothesis and data

Visualizations

The next graphs show the surface and contour plots that allow us to visualize the minimum value of thetas that produce the most accurate hypothesis.

Surface	Contour Plot

Multiple features linear regression

The exercise is to predict the sell price of a house given two features: It's size and the number of bedrooms it has.

Feature normalization

The first step was to normalize the features using mean normalization. This will guarantee that all features are within the range of -1 <= xi <=1 and that the normalized matrix with have mean 0 and standard deviation 1. To do this, I used dimensional analysis instead of loops to calculate the normalized matrix.

The normalization formula was:

Mean normalization of features i

Cost Calculation and Gradient Descent

The vectorized/matrix implementations done previously for the cost calculation and the gradient descent also apply to the multiple variables, since we are using the same hypothesis.

Learning rate

To get insights into what the best learning rate for the algorithm was, I plotted multiple figures, each with a learning rate that increased at about 3 times the previous rate. The best learning value found was about 1 as the algorithm started to diverge at around 1.5.

Different Alpha Rates Tested	Best Learning Rate found

Normal Equation

With the normal equation, we can also find theta without having to set alpha or to iterate like in gradient descent. This is good for a low amount of features but would be bad if n is very larger. The normal equation is:

Normal Equation

sjonnala/ml-stanford-coursera

Table of Contents

Overview of the solution

One vs all

Neural Networks

Overview of the solution

Visualization

Sigmoid Function

The Sigmoid Function

Vectorized implementation

Cost Function and gradient

Prediction

Microchip Classification

Visualizing the data

Feature Mapping

Cost Function and Gradient

Underfitting (high bias), Just Right and Overfitting (high variance)

Overview of the solution

Result

Visualizations

Multiple features linear regression

Feature normalization

Cost Calculation and Gradient Descent

Learning rate

Normal Equation