Multiple Linear Regression - Cumulative Lab

Introduction

In this cumulative lab you'll perform an end-to-end analysis of a dataset using multiple linear regression.

Objectives

You will be able to:

  • Prepare data for regression analysis using pandas
  • Build multiple linear regression models using StatsModels
  • Measure regression model performance
  • Interpret multiple linear regression coefficients

Your Task: Develop a Model of Diamond Prices

tweezers holding a diamond

Photo by Tahlia Doyle on Unsplash

Business Understanding

You've been asked to perform an analysis to see how various factors impact the price of diamonds. There are various guides online that claim to tell consumers how to avoid getting "ripped off", but you've been asked to dig into the data to see whether these claims ring true.

Data Understanding

We have downloaded a diamonds dataset from Kaggle, which came with this description:

  • price price in US dollars ($326--$18,823)
  • carat weight of the diamond (0.2--5.01)
  • cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  • color diamond colour, from J (worst) to D (best)
  • clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • x length in mm (0--10.74)
  • y width in mm (0--58.9)
  • z depth in mm (0--31.8)
  • depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
  • table width of top of diamond relative to widest point (43--95)

Requirements

1. Load the Data Using Pandas

Practice once again with loading CSV data into a pandas dataframe.

2. Build a Baseline Simple Linear Regression Model

Identify the feature that is most correlated with price and build a StatsModels linear regression model using just that feature.

3. Evaluate and Interpret Baseline Model Results

Explain the overall performance as well as parameter coefficients for the baseline simple linear regression model.

4. Prepare a Categorical Feature for Multiple Regression Modeling

Identify a promising categorical feature and use pd.get_dummies() to prepare it for modeling.

5. Build a Multiple Linear Regression Model

Using the data from Step 4, create a second StatsModels linear regression model using one numeric feature and one one-hot encoded categorical feature.

6. Evaluate and Interpret Multiple Linear Regression Model Results

Explain the performance of the new model in comparison with the baseline, and interpret the new parameter coefficients.

1. Load the Data Using Pandas

Import pandas (with the standard alias pd), and load the data from the file diamonds.csv into a DataFrame called diamonds.

Be sure to specify index_col=0 to avoid creating an "Unnamed: 0" column.

# Your code here

The following code checks that you loaded the data correctly:

# Run this cell without changes

# diamonds should be a dataframe
assert type(diamonds) == pd.DataFrame

# Check that there are the correct number of rows
assert diamonds.shape[0] == 53940

# Check that there are the correct number of columns
# (if this crashes, make sure you specified `index_col=0`)
assert diamonds.shape[1] == 10

Inspect the distributions of the numeric features:

# Run this cell without changes
diamonds.describe()

And inspect the value counts for the categorical features:

# Run this cell without changes
categoricals = diamonds.select_dtypes("object")

for col in categoricals:
    print(diamonds[col].value_counts(), "\n")

2. Build a Baseline Simple Linear Regression Model

Identifying a Highly Correlated Predictor

The target variable is price. Look at the correlation coefficients for all of the predictor variables to find the one with the highest correlation with price.

# Your code here - look at correlations

Identify the name of the predictor column with the strongest correlation below.

# Replace None with appropriate code
most_correlated = None

The following code checks that you specified a column correctly:

# Run this cell without changes

# most_correlated should be a string
assert type(most_correlated) == str

# most_correlated should be one of the columns other than price
assert most_correlated in diamonds.drop("price", axis=1).columns

Plotting the Predictor vs. Price

We'll also create a scatter plot of that variable vs. price:

# Run this cell without changes

# Plot a sample of 1000 data points, most_correlated vs. price
diamonds.sample(1000, random_state=1).plot.scatter(x=most_correlated, y="price");

Setting Up Variables for Regression

Declare y and X_baseline variables, where y is a Series containing price data and X_baseline is a DataFrame containing the column with the strongest correlation.

# Replace None with appropriate code
y = None
X_baseline = None

The following code checks that you created valid y and X_baseline variables:

# Run this code without changes

# y should be a series
assert type(y) == pd.Series

# y should contain about 54k rows
assert y.shape == (53940,)

# X_baseline should be a DataFrame
assert type(X_baseline) == pd.DataFrame

# X_baseline should contain the same number of rows as y
assert X_baseline.shape[0] == y.shape[0]

# X_baseline should have 1 column
assert X_baseline.shape[1] == 1

Creating and Fitting Simple Linear Regression

The following code uses your variables to build and fit a simple linear regression.

# Run this cell without changes
import statsmodels.api as sm

baseline_model = sm.OLS(y, sm.add_constant(X_baseline))
baseline_results = baseline_model.fit()

3. Evaluate and Interpret Baseline Model Results

Write any necessary code to evaluate the model performance overall and interpret its coefficients.

# Your code here

Then summarize your findings below:

# Your written answer here
Solution (click to expand)

carat was the attribute most strongly correlated with price, therefore our model is describing this relationship.

Overall this model is statistically significant and explains about 85% of the variance in price. In a typical prediction, the model is off by about $1k.

  • The intercept is at about -\$2.3k. This means that a zero-carat diamond would sell for -\$2.3k.
  • The coefficient for carat is about \$7.8k. This means for each additional carat, the diamond costs about \$7.8k more.

4. Prepare a Categorical Feature for Multiple Regression Modeling

Now let's go beyond our simple linear regression and add a categorical feature.

Identifying a Promising Predictor

Below we create bar graphs for the categories present in each categorical feature:

# Run this code without changes
import matplotlib.pyplot as plt

categorical_features = diamonds.select_dtypes("object").columns
fig, axes = plt.subplots(ncols=len(categorical_features), figsize=(12,5))

for index, feature in enumerate(categorical_features):
    diamonds.groupby(feature).mean().plot.bar(
        y="price", ax=axes[index])

Identify the name of the categorical predictor column you want to use in your model below. The choice here is more open-ended than choosing the numeric predictor above -- choose something that will be interpretable in a final model, and where the different categories seem to have an impact on the price.

# Replace None with appropriate code
cat_col = None

The following code checks that you specified a column correctly:

# Run this cell without changes

# cat_col should be a string
assert type(cat_col) == str

# cat_col should be one of the categorical columns
assert cat_col in diamonds.select_dtypes("object").columns

Setting Up Variables for Regression

The code below creates a variable X_iterated: a DataFrame containing the column with the strongest correlation and your selected categorical feature.

# Run this cell without changes
X_iterated = diamonds[[most_correlated, cat_col]]
X_iterated

Preprocessing Categorical Variable

If we tried to pass X_iterated as-is into sm.OLS, we would get an error. We need to use pd.get_dummies to create dummy variables for cat_col.

DO NOT use drop_first=True, so that you can intentionally set a meaningful reference category instead.

# Replace None with appropriate code

# Use pd.get_dummies to one-hot encode the categorical column in X_iterated
X_iterated = None
X_iterated

The following code checks that you have the right number of columns:

# Run this cell without changes

# X_iterated should be a dataframe
assert type(X_iterated) == pd.DataFrame

# You should have the number of unique values in one of the
# categorical columns + 1 (representing the numeric predictor)
valid_col_nums = diamonds.select_dtypes("object").nunique() + 1

# Check that there are the correct number of columns
# (if this crashes, make sure you did not use `drop_first=True`)
assert X_iterated.shape[1] in valid_col_nums.values

Now, applying your domain understanding, choose a column to drop and drop it. This category should make sense as a "baseline" or "reference". For the "cut_Very Good" column that was generated when pd.get_dummies was used, we need to remove the space in the column name.

# Your code here

We now need to change the boolean values for the four "cut" column to 1s and 0s in order for the regression to run.

# Your code here

Now you should have 1 fewer column than before:

# Run this cell without changes

# Check that there are the correct number of columns
assert X_iterated.shape[1] in (valid_col_nums - 1).values

5. Build a Multiple Linear Regression Model

Using the y variable from our previous model and X_iterated, build a model called iterated_model and a regression results object called iterated_results.

# Your code here

6. Evaluate and Interpret Multiple Linear Regression Model Results

If the model was set up correctly, the following code will print the results summary.

# Run this cell without changes
print(iterated_results.summary())

Summarize your findings below. How did the iterated model perform overall? How does this compare to the baseline model? What do the coefficients mean?

Create as many additional cells as needed.

# Your written answer here

Summary

Congratulations, you completed an iterative linear regression process! You practiced developing a baseline and an iterated model, as well as identifying promising predictors from both numeric and categorical features.