/gradient-calculation-linear-regression

This repository provides the calculus and linear algebra behind the linear regression gradient calculation. Both vectorized and non-vectorized forms are shown.

Linear regression gradient calculation

This post takes you through the gradient calculation for the linear regression cost function. A deeper dive with more context is available at the Accompanying blog post

Linear regression equation

Below is the linear regression equation:

$$ y = \sum_{j=0}^{n} \theta_j x_j $$

Below is the linear regression cost function

$$\begin{aligned} & J = \frac{1}{2m} \sum_{i=1}^{m} (\sum_{j=0}^{n} \theta_j x_j^{(i)} - y^{(i)})^2 \end{aligned}$$

Non-vectorized derivative calculation

A non-vectorized calculation of derivatives is performed in this section. First, a list of derivation rules we may need to use are listed below:

$$\begin{aligned} \frac{d}{dx}(f(x)+g(x)) & = \frac{d}{dx}f(x) + \frac{d}{dx}g(x) \quad\quad && \text{(1.1) Sum rule }\\\ f(x) & =ax^b \Rightarrow \frac{d}{dx}f(x) = bax^{b-1} \quad\quad && \text{(1.2) Power rule }\\\ h(x) & = f(x)g(x) \Rightarrow \frac{d}{dx}h(x) = f(x)\frac{d}{dx}g(x) + \frac{d}{dx}f(x)g(x) \quad\quad && \text{(1.3) Product rule}\\\ h(x) & = g(f(x)) \Rightarrow \frac{d}{dx}h(x) = \frac{d}{dx}g(f(x)) \frac{d}{dx}f(x) \quad\quad && \text{(1.4) Chain rule} \\\ \frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) & = \frac{g(x)\frac{d}{dx}f(x) - f(x)\frac{d}{dx}g(x)}{(g(x))^2} \quad\quad && \text{(1.5) Quotient rule} \end{aligned}$$

We proceed with the step-by-step derivation of the cost function. Since the cost function is multivariable, our derivatives turn into partial derivatives that need to be calculated for each parameter $\theta_{k}$.

$$\begin{aligned} \frac{\partial}{\partial \theta_k }J & = \frac{\partial}{\partial \theta_k} \left( \frac{1}{2m} \sum_{i=1}^{m} \left( \sum_{j=0}^{n} \theta_j x_j^{(i)} - y^{(i)} \right)^2 \right) \quad\quad &&\text{apply (1.4), (1.2)}\\\ & = \frac{1}{2m} \sum_{i=1}^{m} 2 \left( \sum_{j=0}^{n} \theta_j x_j^{(i)} - y^{(i)} \right) \frac{\partial}{\partial \theta_k} \left( \sum_{j=0}^{n} \theta_j x_j^{(i)} - y^{(i)} \right) \quad\quad &&\text{apply (1.2)} \\\ & = \frac{1}{m} \sum_{i=1}^{m} \left( \sum_{j=0}^{n} \theta_j x_j^{(i)} - y^{(i)} \right) x_k^{(i)} \end{aligned}$$

Vectorizing the derivation result

Vectorization of our derivation result is performed here. The inner sum can be rewritten as a dot product or a trivial matrix multiplication.

$$\begin{aligned} & \frac{\partial}{\partial \theta_k }J = \frac{1}{m} \sum_{i=1}^{m} \left( \sum_{j=0}^{n} \theta_j x_j^{(i)} - y^{(i)} \right) x_k^{(i)} \\\ & \frac{\partial}{\partial \theta_k }J = \left( \frac{1}{m} \sum_{i=1}^{m} \left( \boldsymbol{\theta}^T \textbf{x}^{(i)} - y^{(i)} \right) x_k^{(i)} \right) \end{aligned}$$

Calculating all the partial derivaties at once can be expressed as calculating the vector of partial derivatives, i.e. the gradient.

$$\begin{aligned} \nabla_{\boldsymbol{\theta}} J & = \left( \frac{\partial}{\partial \theta_0} J, \frac{\partial}{\partial \theta_1} J, \dots, \frac{\partial}{\partial \theta_n} J \right) \\\ & = \left( \frac{1}{m} \sum_{i=1}^{m} \left( \boldsymbol{\theta}^T \textbf{x}^{(i)} - y^{(i)} \right) x_0^{(i)} , \dots, \frac{1}{m} \sum_{i=1}^{m} \left( \boldsymbol{\theta}^T \textbf{x}^{(i)} - y^{(i)} \right) x_n^{(i)} \right) \\\ & = \frac{1}{m} \sum_{i=1}^{m} \left( \boldsymbol{\theta}^T \textbf{x}^{(i)} - y^{(i)} \right) \textbf{x}^{(i)} \\\ \end{aligned}$$

Last sum to vectorize is the summation over the dataset. To vectorize we first introduce a matrix $\textbf{X}$ where each row is $(\textbf{x}^{(i)})^T$. Likewise, we introduce $\textbf{y}$ as a vector of outputs. Predictions are now of the form $\textbf{X}\boldsymbol{\theta}$ and the vector of residuals is $\textbf{X}\boldsymbol{\theta} - \textbf{y}$. We are left to take care of our $\textbf{x}^{(i)}$ in the previous equation. Remember that the gradient must be a vector.

Let us first understand what is done in the previous equation with $\textbf{x}^{(i)}$. Our parentheses, when solved, produced a single number, and $\textbf{x}^{(i)}$ was a vector. The summation, therefore, performs a summation over individual scalar vector products. Returning back to our new equation we have a vector of residuals. Each value of that vector needs to multiply the corresponding vector $\textbf{x}^{(i)}$ and all of the results need to be summed. If we transpose $\textbf{X}$ and put it on the left side of the equation we get exactly that.

$$ \nabla_{\boldsymbol{\theta}} J = \frac{1}{m} \textbf{X}^T(\textbf{X}\boldsymbol{\theta} - \textbf{y}) $$

Vectorized derivative calculation

Vectorization is usually done before any derivation rules are used. In this section we vectorize immediately and apply new derivation rules for the vectorized form.

$$\begin{aligned} & \frac{1}{2m} \sum_{i=1}^{m} \left( \sum_{j=0}^{n} \theta_j x_j^{(i)} - y^{(i)} \right)^2 \\\ & \frac{1}{2m} \sum_{i=1}^{m} \left( \boldsymbol{\theta}^T \textbf{x}^{(i)} - y^{(i)} \right)^2 \\\ & \frac{1}{2m} ( \textbf{X}\boldsymbol{\theta} - \textbf{y} )^2 \\\ \end{aligned}$$

We will focus on the term without normalization and account for it at the end. Square operation will be split as follows.

$$\begin{aligned} & ( \textbf{X}\boldsymbol{\theta} - \textbf{y} )^2 = ( \textbf{X}\boldsymbol{\theta} - \textbf{y} )^T ( \textbf{X}\boldsymbol{\theta} - \textbf{y} ) \end{aligned}$$

Before calculating the gradient, we will list the calculus / linear algebra rules that will be useful.

$$\begin{aligned} \textbf{a}^T \textbf{b} & = \textbf{b}^T \textbf{a} \quad\quad && \text{(2.1)}\\\ (\textbf{A} \textbf{B})^T & = \textbf{B}^T \textbf{A}^T \quad\quad && \text{(2.2)}\\\ (\textbf{A} + \textbf{B})^T & = \textbf{A}^T + \textbf{B}^T \quad\quad && \text{(2.3)}\\\ \nabla_{x} \textbf{b}^T \textbf{x} & = \textbf{b} \quad\quad && \text{(2.4)}\\\ \nabla_{x} \textbf{x}^T \textbf{A} \textbf{x} & = 2 \textbf{A} \textbf{x} \quad\quad && \text{(2.5)}\\\ \nabla_{\boldsymbol{x}} \textbf{W}\textbf{x} & = \textbf{W} \quad\quad && \text{(2.6)}\\\ \nabla_{\boldsymbol{x}} \textbf{x}^T\textbf{W} & = \textbf{W}^T \quad\quad && \text{(2.7)}\\\ a^T & =a \quad\quad && \text{(2.8)} \end{aligned}$$

Gradient calculation follows.

$$\begin{aligned} & = \nabla_{\boldsymbol{\theta}} ( \textbf{X}\boldsymbol{\theta} - \textbf{y} )^T ( \textbf{X}\boldsymbol{\theta} - \textbf{y} ) \quad\quad &&\text{apply (2.3), (2.2)}\\\ & = \nabla_{\boldsymbol{\theta}} ( \boldsymbol{\theta}^T \textbf{X}^T - \textbf{y}^T ) ( \textbf{X}\boldsymbol{\theta} - \textbf{y} ) \\\ & = \nabla_{\boldsymbol{\theta}} ( \boldsymbol{\theta}^T\textbf{X}^T\textbf{X}\boldsymbol{\theta} - \textbf{y}^T\textbf{X}\boldsymbol{\theta} - \boldsymbol{\theta}^T\textbf{X}^T\textbf{y} + \textbf{y}^T\textbf{y} ) \\\ \end{aligned}$$

The term $\boldsymbol{\theta}^T\textbf{X}^T\textbf{y}$ is actually a scalar if you consider individual element dimensions. A transpose of a scalar is that same scalar, so a transpose applied to it wouldn't change anything but it would simplify our equation.

$$\begin{aligned} & = \nabla_{\boldsymbol{\theta}} ( \boldsymbol{\theta}^T\textbf{X}^T\textbf{X}\boldsymbol{\theta} - \textbf{y}^T\textbf{X}\boldsymbol{\theta} - \boldsymbol{\theta}^T\textbf{X}^T\textbf{y} + \textbf{y}^T\textbf{y} ) &&\text{apply (2.8)}\\\ & = \nabla_{\boldsymbol{\theta}} ( \boldsymbol{\theta}^T\textbf{X}^T\textbf{X}\boldsymbol{\theta} - \textbf{y}^T\textbf{X}\boldsymbol{\theta} - ( \boldsymbol{\theta}^T\textbf{X}^T\textbf{y} )^T + \textbf{y}^T\textbf{y} ) &&\text{apply (2.2)}\\\ & = \nabla_{\boldsymbol{\theta}} ( \boldsymbol{\theta}^T\textbf{X}^T\textbf{X}\boldsymbol{\theta} - \textbf{y}^T\textbf{X}\boldsymbol{\theta} - \textbf{y}^T\textbf{X}\boldsymbol{\theta} + \textbf{y}^T\textbf{y} ) \\\ & = \nabla_{\boldsymbol{\theta}} ( \boldsymbol{\theta}^T\textbf{X}^T\textbf{X}\boldsymbol{\theta} - 2\textbf{y}^T\textbf{X}\boldsymbol{\theta} + \textbf{y}^T\textbf{y} ) \\\ & = \nabla_{\boldsymbol{\theta}} ( \boldsymbol{\theta}^T\textbf{X}^T\textbf{X}\boldsymbol{\theta} ) - 2\nabla_{\boldsymbol{\theta}} (\textbf{y}^T\textbf{X}\boldsymbol{\theta} ) + \nabla_{\boldsymbol{\theta}} (\textbf{y}^T\textbf{y} ) \\\ \end{aligned}$$

We take a look at each gradient individually:

$$\begin{aligned} & \nabla_{\boldsymbol{\theta}} (\boldsymbol{\theta}^T\textbf{X}^T\textbf{X}\boldsymbol{\theta} ) = 2 \textbf{X}^T\textbf{X}\boldsymbol{\theta} &&\text{apply (2.5)}\\\ & \nabla_{\boldsymbol{\theta}} (\textbf{y}^T\textbf{X}\boldsymbol{\theta} ) = \nabla_{\boldsymbol{\theta}} ((\textbf{X}^T\textbf{y})^T\boldsymbol{\theta}) = \textbf{X}^T\textbf{y} &&\text{apply (2.2), (2.6)}\\\ & \nabla_{\boldsymbol{\theta}} (\textbf{y}^T\textbf{y} ) = 0 \end{aligned}$$

We obtain:

$$\begin{aligned} & = 2 \textbf{X}^T\textbf{X}\boldsymbol{\theta} - 2 \textbf{X}^T\textbf{y} \\\ & = 2 \textbf{X}^T(\textbf{X}\boldsymbol{\theta} - \textbf{y}) \end{aligned}$$

Accounting for the normalization we arrive at:

$$\begin{aligned} & \frac{1}{2m} 2 \textbf{X}^T(\textbf{X}\boldsymbol{\theta} - \textbf{y}) = \frac{1}{m} \textbf{X}^T(\textbf{X}\boldsymbol{\theta} - \textbf{y}) \end{aligned}$$