- Understand how regression lines can help us make predictions about data
- Understand how the components of slope and y-intercept determine the output of a regression line
- Understand how to represent a line as a function
Now that we know a little bit about plotting data, let's see if we can make sense of some data. We'll start with trying to use data to predict how much money a movie will make. In trying to predict the box office success of a movie, screen writer William Goldman famously said, "nobody knows anything." Well, let's try to know something.
Imagine we are hired as a consultant for a movie executive. The movie executive receives a budget proposal, and wants to see how much money the movie might make. We can help him by trying to see the relationship between money spent on a movie, and money made.
Here are five movies:
movies = [{'title': 'American Hustle', 'budget': 40000000, 'revenue': 148430908}, {'title': 'Captain Phillips', 'budget': 55000000, 'revenue': 107136417}, {'title': 'Frozen', 'budget': 150000000, 'revenue': 393050114}, {'title': 'Gravity', 'budget': 110000000, 'revenue': 271814796}, {'title': 'Despicable Me 2', 'budget': 76000000, 'revenue': 368065385}]
Remember that when we want to plot data, we translate the values to budget
as the x value and the y value as revenue
. Let's just plot a few movies to get started.
So using our trace_values
method, we can plot these points, so long as we pass through the list of x_values
, y_values
, and text_values
. You can see the functions that we built out previously in our graph.py file.
x_values = list(map(lambda movie: movie['budget'], movies))
y_values = list(map(lambda movie: movie['revenue'], movies))
text_values = list(map(lambda movie: movie['title'], movies))
import plotly
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)
from graph import trace_values, plot, layout
movie_trace = trace_values(x_values, y_values, text_values = text_values, name = 'movie revenue')
movie_layout = layout(options = {'title': 'Movie Spending and Revenue'})
plot([movie_trace], movie_layout)
This plot shows us that as the movie budget increases the movie revenue tends to increase. For example, look at point furthest left at a bugdet of 40 million. That point represents the movie "American Hustle", with 40 million dollars spent and 148 million dollars earned domestically. Gravity, in the center of our plot, spent over twice as much and earned almost twice as much.
So, at least we now know something.
Ok, now imagine our movie executive tells us that a movie came across his desk with a budget of $55 million. Based on the data we graphed, how much money do you think the movie would bring in?
To predict movie revenue based on a budget, let's draw a single straight line that approximates the relationship between a movie's budget and revenue using our previous data as a benchmark.
Later, we'll worry about how well a line like the one below describes our data. For now, let's use this.
regression_trace = trace_values([0, 150000000], [0, 450000000], mode = 'lines', name = 'estimated revenue')
plot([movie_trace, regression_trace], movie_layout)
One of the benefits of using a line is that we can see how much money will be brought in for any point on this line. All we need to do is look at a given
Instead of just representing this line visually, we would also like to represent this line with a function. This way, instead of us needing to see the
Let's take an initial (wrong) guess as to how to turn this line into a function. First we represent the line as a mathematical formula:
And then we turn this formula into a function:
def y(x):
return x
y(0)
0
y(10000000)
10000000
This is pretty nice. We just wrote a function that automatically calculates the expected revenue given a movie budget. This function says that for every value of
But take a look at the line that we drew. Our line says something different. The line says that spending 30 million brings predicted earnings of 90 million.
So we need to change our function so that it lines up with our line. In fact, we need a consistent way to turn lines into functions, and vice versa. Ok, let's get to it.
We can start by taking a look at our chart below, which shows how our line relates x-values and y-values -- that is budget, and revenue.
X (budget) | Y (revenue) |
---|---|
0 | 0 |
30 million | 90 million |
60 million | 180 million |
Ok, so now we need an equation that will allow us to input 0 and get back 0, input 30 million and get back 40 million, and input 60 million and get back 80 million? What equation is that.
Well it's
- 0 = 30 million * 0
- 90 million = 3 * 30 million
- 180 million = 3 * 60 million
Let's see it in the code, and then in the next section we'll show how we figured this out.
Ok, this is what this formula looks like in code.
def y(x):
return 4/3*x
y(30000000)
40000000.0
y(0)
0.0
Progress! So we added a number to multiply each value of
What you just saw, that value of 3, is called the slope variable. It's generally used in describing a line. You will see represented generally as
Let's make sure we understand what all of these variables stand for. Here they are:
-
$y$ : the value that is returned, also called the response variable, as it responds to values of$x$ -
$x$ : the input variable, also called the explanatory variable, as it explains the value of$y$ -
$m$ : the slope variable, determines how vertical or horizontal the line will be
In our movie example, these terms make sense. The
The variable
Ok, there is just one more thing that we need to be able to learn before being able to describe every straight line in a two dimensional world. That is the y-intercept.
The y-intercept is the
regression_trace_increased = trace_values([0, 150000000], [50000000, 500000000], mode = 'lines', name = 'increased est. revenue')
plot([movie_trace, regression_trace, regression_trace_increased], movie_layout)
So looking at the graph, what is the y intercept of the original estimated revenue line? Well it's the value of
In addition to determining the y-intercept from a line on a graph, you can also see the y-intercept by looking at a chart of points. So in the chart below, we can see that 50 million is our y-intercept of the new line. After all, its the value of
X | Y |
---|---|
0 | 50 million |
40 million | 170 million |
60 million | 230 million |
Great, so now we have our all of the information we need to describe any straight line.
In this formula,
In the context of our movies, we said that the the line with values of
Now let's translate our formula into a function, so that for any input of
def y(x):
return 3*x + 50000000
y(30000000)
140000000
y(60000000)
230000000
In this section, we saw how we can estimate the relationship between an input variable and an output. We did so by plotting our points and then drawing a straight line right through them. We can see any output on a line for a given input simply by looking at the y-value of the line at that point of