Imagine that you are the producer for a comedy show at your school. We need you to use knowledge of linear regression to make predictions as to the success of the show.
The comedy show is trying to figure out how much money to spend on advertising in the student newspaper. The newspaper tells the show that
- For every two dollars spent on advertising, three students attend the show.
- If no money is spent on advertising, no one will attend the show.
Write a linear regression function called attendance
that shows the relationship between advertising and attendance expressed by the newspaper.
def attendance(advertising):
pass
attendance(100) # 150
attendance(50) # 75
As the old adage goes, "Don't ask the barber if you need a haircut!" Likewise, despite what the student newspaper says, the comedy show knows from experience that they'll still have a crowd even without an advertising budget. Some of the comedians in the show have friends (believe it or not), and twenty of those friends will show up. Write a function called attendance_with_friends
that models the following:
- When the advertising budget is zero, 20 friends still attend
- Three additional people attend the show for every two dollars spent on advertising
def attendance_with_friends(advertising):
pass
attendance_with_friends(100) # 170
attendance_with_friends(50) # 95
Let's help plot this line so you can get a sense of what your
Our x values can be a list of initial_sample_budgets
, equal to a list of our budgets. And we can use the outputs of our attendance_with_friends
function to determine the list of attendance_values
, the attendance at each of those x values.
initial_sample_budgets = [0, 50, 100]
attendance_values = [20, 95, 170]
First we import the necessary plotly library, and graph_obs
function, and setup plotly
to be used without uploading our plots to its website.
Finally, we plot out our regression line using our attendance_with_friends
function. Our x values will be the budgets. For our y values, we need to use our attendance_with_friends
function to create a list of y-value attendances for every input of x.
import plotly
from plotly import graph_objs
plotly.offline.init_notebook_mode(connected=True)
trace_of_attendance_with_friends = graph_objs.Scatter(
x=initial_sample_budgets,
y=attendance_values,
)
plotly.offline.iplot([trace_of_attendance_with_friends])
trace_of_attendance_with_friends
Now let's write a couple functions that we can use going forward. We'll write a function called m_b_data
that given a slope of a line, x
pointing to a list of x_values
, and a key of y
that points to a list of y_values
. Each x_values
.
def m_b_data(m, b, x_values):
pass
m_b_data(1.5, 20, [0, 50, 100]) # {'x': [0, 50, 100], 'y': [20.0, 95.0, 170.0]}
Now let's write a function called m_b_trace
that uses our m_b_data
function to return a dictionary that includes keys of name
and mode
in addition to x
and y
. The values of mode
and name
are provided as arguments. When the mode
argument is not provided, it has a default value of lines
and when name
is not provided, it has a default value of line function
.
def m_b_trace(m, b, x_values, mode = 'lines', name = 'line function'):
pass
m_b_trace(1.5, 20, [0, 50, 100])
# {'mode': 'line', 'name': 'line function', 'x': [0, 50, 100], 'y': [20.0, 95.0, 170.0]}
The comedy show decides to advertise for two different shows. The attendance looks like the following.
Budgets (dollars) | Attendance |
---|---|
200 | 400 |
400 | 700 |
In code, we represent this as the following:
first_show = {'budget': 200, 'attendance': 400}
second_show = {'budget': 400, 'attendance': 700}
Write a function called marginal_return_on_budget
that returns the expected amount of increase per every dollar spent on budget.
The function should use the formula for calculating the slope of a line provided two points.
def marginal_return_on_budget(first_show, second_show):
pass
marginal_return_on_budget(first_show, second_show) # 1.5
first_show
Just to check, let's use some different data to make sure our marginal_return_on_budget
function calculates the slope properly.
imaginary_third_show = {'budget': 300, 'attendance': 500}
imaginary_fourth_show = {'budget': 600, 'attendance': 900}
marginal_return_on_budget(imaginary_third_show, imaginary_fourth_show) # 1.3333333333333333
Great! Now we'll begin to write functions that we can use going forward. The functions will calculate attributes of lines in general and can be used to predict the attendance of the comedy show.
Take the following data. The comedy show spends zero dollars on advertising for the next show. The attendance chart now looks like the following:
Budgets (dollars) | Attendance |
---|---|
0 | 100 |
200 | 400 |
400 | 700 |
budgets = [0, 200, 400]
attendance_numbers = [100, 400, 700]
To get you started, we'll provide a function called sorted_points
that accepts a list of x values and a list of y values and returns a list of point coordinates sorted by their x values. The return value is a list of sorted tuples.
def sorted_points(x_values, y_values):
values = list(zip(x_values, y_values))
sorted_values = sorted(values, key=lambda value: value[0])
return sorted_values
sorted_points([4, 1, 6], [4, 6, 7])
In this section, we'll write a function called build_starting_line
. The function that we end up building simply draws a line between our points with the highest and lowest x values. We are selecting these points as an arbitrary "starting" point for our regression line.
As John von Neumann said, "truth … is much too complicated to allow anything but approximations." All models are inherently wrong, but some are useful. In future lessons, we will learn how to build a regression line that accurately matches our dataset. For now, we will focus on building a useful "starting" line using the first and last points along the x-axis.
First, write a function called slope
that, given a list of x values and a list of y values, will use the points with the lowest and highest x values to calculate the slope of a line.
def slope(x_values, y_values):
pass
slope([200, 400], [400, 700]) # 1.5
Now write a function called y_intercept
. Use the slope
function to calculate the slope if it isn't provided as an argument. Then we will use the slope and the values of the point with the highest x value to return the y-intercept.
def y_intercept(x_values, y_values, m = None):
pass
y_intercept([200, 400], [400, 700]) # 100
y_intercept([0, 200, 400], [10, 400, 700]) # 10
Now write a function called build_starting_line
that given a list of x_values
and a list of y_values
returns a dictionary with a key of m
and a key of b
to return the m
and b
values of the calculated regression line. Use the slope
and y_intercept
functions to calculate the line.
def build_starting_line(x_values, y_values):
pass
build_starting_line([0, 200, 400], [10, 400, 700]) # {'b': 10.0, 'm': 1.725}
Finally, let's write a function called expected_value_for_line
that returns the expected attendance given the
first_show = {'budget': 300, 'attendance': 700}
second_show = {'budget': 400, 'attendance': 900}
shows = [first_show, second_show]
def expected_value_for_line(m, b, x_value):
pass
expected_value_for_line(1.5, 100, 100) # 250
Now that we have built these functions, we can use them on our dataset. Uncomment and run the lines below to see how we can use our functions going forward.
first_show = {'budget': 200, 'attendance': 400}
second_show = {'budget': 400, 'attendance': 700}
third_show = {'budget': 300, 'attendance': 500}
fourth_show = {'budget': 600, 'attendance': 900}
comedy_shows = [first_show, second_show, third_show, fourth_show]
show_x_values = list(map(lambda show: show['budget'], comedy_shows))
show_y_values = list(map(lambda show: show['attendance'], comedy_shows))
def trace_values(x_values, y_values, mode = 'markers', name="data"):
return {'x': x_values, 'y': y_values, 'mode': mode, 'name': name}
def plot(traces):
plotly.offline.iplot(traces)
comedy_show_trace = trace_values(show_x_values, show_y_values, name = 'comedy show data')
comedy_show_trace
show_starting_line = build_starting_line(show_x_values, show_y_values)
show_starting_line
trace_show_line = m_b_trace(show_starting_line['m'], show_starting_line['b'], show_x_values, name = 'starting line')
trace_show_line
plot([comedy_show_trace, trace_show_line])
As we can see above, we built a "starting" regression line out of the points with the lowest and highest x values. We will learn in future lessons how to improve our line so that it becomes the "best fit" given all of our dataset, not just the first and last points. For now, this approach sufficed since our goal was to practice working with and plotting line functions.