In the last lesson, we derived the functions that we help us descend along our cost functions efficiently. Remember that this technique is not so different from what we saw with using the derivative to tell us our next step size and direction in two dimensions.
When descending along our cost curve in two dimensions, we used the slope of the tangent line at each point, to tell us how large of a step to take next. And with the cost curve being a function of
But really it's an analogous approach. Just like we can calculate the use derivative of a function
You will be able to:
- Create a full gradient descent algorithm
- Apply a gradient descent algorithm on a data set with more than one variable
Luckily for us, we already did the hard work of deriving these formulas. Now we get to see the fruit of our labor. The following formulas tell us how to update regression variables of
- $ \frac{dJ}{dm}J(m,b) = -2\sum_{i = 1}^n x_i(y_i - (mx_i + b)) = -2\sum_{i = 1}^n x_i*\epsilon_i$
- $ \frac{dJ}{db}J(m,b) = -2\sum_{i = 1}^n(y_i - (mx_i + b)) = -2\sum_{i = 1}^n \epsilon_i $
Now the formulas above tell us to take some dataset, with values of
current_m
= old_m
$ - (-2*\sum_{i=1}^n x_i*\epsilon_i )$
current_b
= old_b
$ - ( -2*\sum_{i=1}^n \epsilon_i )$
Ok let's turn this into code. First, let's initialize our data like we did before:
import numpy as np
np.set_printoptions(formatter={'float_kind':'{:f}'.format})
import matplotlib.pyplot as plt
np.random.seed(225)
x = np.random.rand(30, 1).reshape(30)
y_randterm = np.random.normal(0,3,30)
y = 3 + 50* x + y_randterm
data = np.array([y, x])
data = np.transpose(data)
plt.plot(x, y, '.b')
plt.xlabel("x", fontsize=14)
plt.ylabel("y", fontsize=14);
Now
- Let's set our initial regression line by initializing
$m$ and$b$ variables as zero. Store them inb_current
andm_current
. - Let's next initialize updates to these variables by setting the variables,
update_to_b
andupdate_to_m
equal to 0. - Define an
error_at
function which returns the error$\epsilon_i$ for a given$i$ . The parameters are:
point: a row of the particular data set
$b$ : the intercept term
$m$ : the slope
- Them, use this
error_at
function to iterate through each of the points in the dataset, and at each iteration change ourupdate_to_b
by$2*\epsilon$ and change ourupdate_to_m
by $2x\epsilon$.
# initial variables of our regression line
#amount to update our variables for our next step
# Define the error_at function
# iterate through data to change update_to_b and update_to_m
# Create new_b and new_m by subtracting the updates from the current estimates
In the last two lines of the code above, we calculate our new_b
and new_m
values by updating our taking our current values and adding our respective updates. We define a function called error_at
, which we can use in the error component of our partial derivatives above.
The code above represents just one update to our regression line, and therefore just one step towards our best fit line. We'll just repeat the process to take multiple steps. But first, we have to make a couple of other changes.
Ok, the above code is very close to what we want, but we just need to make tweaks to our code before it's perfect.
The first one is obvious if we think about what these formulas are really telling us to do. Look at the graph below, and think about what it means to change each of our
Multiplying our step size by our learning rate works fine, so long as we multiply both of the partial derivatives by the same amount. This is because without gradient, $ \nabla J(m,b)$, we think of as steering us in the correct direction. In other words, our derivatives ensure we are making the correct proportional changes to
For our second tweak, note that in general the larger the dataset, the larger the sum of our errors would be. But that doesn't mean our formulas are less accurate, and there deserve larger changes. It just means that the total error is larger. But we should really think accuracy as being proportional to the size of our dataset. We can correct for this effect by dividing the effect of our update by the size of our dataset,
Make these changes below:
#amount to update our variables for our next step
# define learning rate and n
# create update_to_b and update_to_m
# create new_b and new_m
So our code now reflects what we know about our gradient descent process. Start with an initial regression line with values of
As mentioned earlier, the code above represents just one update to our regression line, and therefore just one step towards our best fit line. To take multiple steps we wrap the process we want to duplicate in a function called step_gradient
and then can call that function as much as we want. With this function:
- Include a learning_rate of 0.1
- Return a tuple of (b,m)
The parameters should be:
b_current : the starting value of b
m_current : the starting value of m
points : the number of points at which we want to check our gradient
See if you can use your error_at
function within the step_gradient
function!
def step_gradient(b_current, m_current, points):
pass
Now let's initialize b
and m
as 0 and run a first iteration of the step_gradient
function.
# b= 3.02503, m= 2.07286
3.0250308395837813
2.0728619246505193
So just looking at input and output, we begin by setting
# b = 5.63489, m= 3.902265
(5.634896312558807, 3.902265648903966)
Let's do this, say, 1000 times.
# create a for loop to do this
Let's take a look at the estimates in the last iteration.
#
(3.1619764855577257, 49.84313430300858)
As you can see, our m and b values both update with each step. Not only that, but with each step, the size of the changes to m and b decrease. This is because they are approaching a best fit line.
Below, we generated a problem where we have 2 predictors. We generated data such that the best fit line is around step_gradient_multi
function that can take an arbitrary number of predictors (so the function should be able to include more than 2 predictors as well). Good luck!
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(11)
x1 = np.random.rand(100,1).reshape(100)
x2 = np.random.rand(100,1).reshape(100)
y_randterm = np.random.normal(0,0.2,100)
y = 2+ 3* x1+ -4*x2 + y_randterm
data = np.array([y, x1, x2])
data = np.transpose(data)
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5), sharey=True)
ax1.set_title('x_1')
ax1.plot(x1, y, '.b')
ax2.set_title('x_2')
ax2.plot(x2, y, '.b');
Note that, for our gradients, when having multiple predictors
So we'll have one gradient per predictor along with the gradient for the intercept!
Create the step_gradient_multi
function below. As we said before, this means that we have more than one feature that we are using as an independent variable in the regression. This function will have the same inputs as step_gradient
, but it will be able to handle having more than one value for m. It should return the final values for b and m in the form of a tuple.
You might have to refactor your error
at function if you want to use it with multiple m values.
def step_gradient_multi(b_current, m_current ,points):
pass
Apply 1 step to our data
Apply 500 steps to our data
Look at the last step
(1.944428332442866, array([2.995890, -3.911055]))
Try your own gradient descent algorithm on the Boston Housing data set, and compare with the result from scikit learn! Be careful to test on a few continuous variables at first, and see how you perform. Scikit learn has built-in "regularization" parameters to make optimization more feasible for many parameters.
In this section, we saw our gradient descent formulas in action. The core of the gradient descent functions is understanding the two lines:
Which both look to the errors of the current regression line for our dataset to determine how to update the regression line next. These formulas came from our cost function,