/Machine_Learning_Algo_extensions

CS6923 Machine Learning Spring 2020 Final Project -- ML algos and extensions

Primary LanguageJupyter Notebook

CS 6923 Machine Learning Final Project Report

Group members:

Ruinan Zhang rz1109@nyu.edu

Xianbo Gao xg656@nyu.edu

1 Introduction and Report Structure:

1.1 Introduction

In this project, we explored three ML algorithms we learned during the pass of this course, for each basic algorithm, we came up with one extension that can improve the overall performance of that algorithm. And then, we first implemented the basic and extended algorithms using existing libraries (Scikit-learn, Tensorflow,Pytorch...). Second, we built our own extension only Numpy from scratch.

The three algorithms we explored and corresponding extensions are: 1. Support vector machine (SVM) with Soft Margin 2. Linear Regression vs. Lasso Regression with Coordinate descent 3. Neural Network with Convolutional layers

2 Support vector machine (SVM) with Soft Margin

2.1 Explanation about SVM with soft margin algorithm

In homework9, we implemented hard margin SVM and solved both primal and dual problems. In this project, we will focus on solving soft margin dual problems as most datasets in real life are not linearly separable.

Remember in hard margin SVM case, constrained optimization problem (primal) is:

And after we transform our original problem into its duel formalization, adding Lagrangian multiplier, we have a dual optimization problem that is:

The hard margin SVM has its limit as it assumes that our data is linearly separable. But what if we have a few misclassified data or data points that within the margin?

Then we extend to the soft margin SVM case -- which introduces a new tolerance variable (or slack variable) 𝜁(𝑖) for each training example 𝑥(𝑖).

[1] Pic Sourc

Points that lie on the margin will have 𝜁(𝑖)=0. On the other hand, the farther we are from the correct boundary, the larger the value of 𝜁(𝑖) will be larger.

Thus, our optimization function would be:

Notice here C is a tunable parameter, gives relative importance of the error term. It controls how much you want to punish your model for each misclassified point for a given curve. The lower the C parameters, the softer the margin.

After adding the Lagrangian multiplier, to do the kernel trick, then we have the SVM soft margin dual quadratic programming problem:

Then, we use CVXOPT, the quadratic programming problem solver, cvxopt.solvers.qp to solve this problem. We can see this is very similar to the homework's problem of dual for hard margin besides we have an additional constraint on 𝛼.

When using CVXOPT's quadratic solver, we implement this constraint by concatenating below matrix G a diagonal matrix of 1s of size m×m and adding m times value to C in vector h.

2.2 Summary SVM’s soft margin’s implementation using Sklearn and Numpy

  • Datasets:

For SVM, we chose to run this algorithm on two datasets: (1) the Iris dataset used in hw3 (2)Breast Cancer Wisconsin (Diagnostic) dataset

  • Hard Margin and Soft Margin SVMs using Sklearn

In Sklearn’s SVM, we can see it's using C as regularization parameter that controls how much you want to punish your model for each misclassified point for a given curve. The lower the C parameters, the softer the margin.

Since we used linear kernel for SVM during the class, we also chose Linear Kernel for the Iris dataset. However, to explore soft margin’s effect, we also used RBF kernel on the other dataset (Breast Cancer db)

For hard margin SVM, using linear Kernel we set param C to 1000 and for soft margin SV, we set param C to 1.5 for both data sets mentioned above.

  • Soft Margin SVM built using Numpy

We also used C =1.5 and Linear Kernel when we implemented our own Soft margin extension on SVM as shown below:

2.3 Accuracies Table for SVM

Note: since both Iris dataset and Breast cancer dataset are relatively small data set, we used cross-validation function from Sklearn for calculating the accuracies:

2.4 Explanations on specific datasets for SVM

  • Iris Data Using Sklearn, we can see some improvement that for each cross validation set:

In our own Numpy implementation, we analyzed how accuracy might improve on Iris using different param C values:

  • Breast Cancer Dataset

Using Sklearn, we can see some improvement that for each cross validation set:

In our own Numpy implementation, we analyzed how accuracy might improve on Brest Cancer Dataset using different param C values:

From the chart above, different from the Iris data set, we can see that the accuracy first goes up and then goes down as we increase the C value. One possible explanation is at the beginning we have very small penalty so that there's an under fitting problem

2.5 Numpy Code for soft margin SVM

3. Lasso Regression with Coordinate descent

3.1 Explanation about Lasso with coordinate descent

In class, we implemented the Ridge Regression with L2 regularization. In the lecture, the professor also covered Lasso regression with L1 regularization. Not like Ridge Regression that has a closed form solution, the Lasso Regression has no closed form solution to minimize :

Although 𝐸𝑙𝑎𝑠𝑠𝑜 (𝑤) is convex, we can not take the derivative and set it to zero to find the optimal 𝑤. However, it's possible to optimize the 𝑗𝑡ℎ coefficient while the others remain fixed:

One way to solve optimization problems is coordinate descent, in which at each step we optimize over one component of the unknown parameter vector, fixing all other components. The descent path so obtained is a sequence of steps, each of which is parallel to a coordinate axis in Rd, hence the name. It turns out that for the Lasso optimization problem, we can find a closed form solution for optimization over a single component fixing all other components.

In the Numpy code, we will implement the coordinated Descent Algorithm (taken from class slides "Lec 04 Regularization" page 23) :

3.2 Summary of Lasso Regression’s Implementation

  • Dataset

In this extension, we decided to use the Boston Housing (Dataset used in HW4) and NYSE Stock market prices dataset (from outside class) [Data Source: https://www.kaggle.com/dgawlik/nyse/kernels]

  • Sklearn implementation of Linear, Ridge and Lasso regressions

Since in homework 4 we implemented the ridge regression, we think it’s interesting to see how pure linear regression, ridge regression and lasso regression works on the same dataset.

Remember that we did Ridge regression in class with k-fold validation and found the best alpha for the housing price dataset which is alpha = 1.49, so we will use this alpha when implementing Ridge Regression from Sklearn here.

And for Lasso, we used alpha = 0.03 in Sklearn implementation.

  • Numpy Implementation

We set alpha to 0.05 and the max number of iteration for coordinate descent for 1,000

3.3 Accuracies Table for Lasso

Note for regression problem, we used Mean Squared Error (MSE), Mean Absolute Error(MAE), R-square scores(R2) to measure accuracies:

One reason for why Lasso on GOOGLE Stock price's MSE is very high is because the y values which are the stock prices values are very large in themselves and the data has some outliers. Over all, since the R2 scores are pretty high, we can see that our models fits the dataset pretty well.

3.4 Explanations on Specific Datasets Lasso

Let’s see how well our predictions of y compared to the real label of y for each Dataset:

  • Boston housing:

We can see here both three regression from SKlearn has pretty similar performance, one reason might be we don’t have many outliers here in the Boston housing dataset and the Sklearn’s linear regression doesn’t overfit or under fit, it’s already very good.

As we can see here and the accuracies table above, Lasso and Ridge perform better than Linear on NYSE stocks market both Sklearn implementation and our own Numpy implementation. This might because the original dataset has some terrible outliers so the Linear model might over-fit.

3.5 Numpy code for Lasso

3.6 Further explanation on Lasso regression

Although for most datasets that Lasso regression and Ridge Regression can achieve similar MSEs and R2 scores in terms of regularization.

There’s one feature that only Lasso regression has can be very useful -- feature selection.

This is because the coefficient of Lasso regression path can converge to 0 while it never happens in Ridge regression.

For example, the chart below showed the Lasso Path on Boston data:

4 Neural Networks with Convolutional layers (CNN)

4.1 Explanation about CNN Algorithm

The key operation performed in CNN layers is that of 2D convolution. In homework8, we implemented neural networks with fully connected layers.

Difference between fully connected layers and convolutional layers:

In a fully connected layer, each neuron receives input from every element of the previous layer. In a convolutional layer, neurons receive input from only a restricted subarea of the previous layer.

In CNNs, each member of the kernel is used at every feasible position of the input. The parameter sharing used by the convolution operation means that rather than learning a separate set of parameters for every location, we learn only one set.

4.2 Summary of CNN implementation

  1. 2 Convolution layer which has several filters

  2. The weights are initialized randomly

  3. Using ReLU as activation function

  4. Using MaxPool Method when passing the layer

  5. Flatten the layer at the end with softmax

  6. When training, using gradient descent as well

  7. Using forward and backpropagation to update parameters

4.3 Accuracies Table

4.4 Specific explanation on datasets

The dataset using is Minst handwritten digit database http://yann.lecun.com/exdb/mnist/, which contains images of handwritten digit from 0 -9.

Since it takes a long time to train, there’s only one final result. The accuracy of CNN varies a lot with different parameters. For example, the accuracy of Numpy implement would drop to 83.6% for the dataset in homework. There might be some difference from the gradient descent function of Pytorch and numpy implementation so that they didn’t have the same performance with the same hyper parameters. We try to find the best hyper parameters for both ones, so we choose different alpha as well as the params of layers. The reason why numpy implementation performs better that Pytorch is perhaps because of differences in implementation and Pytorch didn’t have optimal hyper parameters.

CNN performs better than NN because it uses the maxpool method to filter the most significant digit with a moving window, and it can extract the shape from the figure by convolution if there’s an obvious shape. This is useful to identify features from the figure.

4.5 Numpy Code

See py notebook under CNN folder