/pricing-midrange-homes

A multiple linear regression project

Primary LanguageJupyter Notebook

Pricing Midrange Homes in King County, WA

In recent years King County, WA, has seen cost of living climb well above state and national averages, and the high cost of housing is a major factor in this increase. A news article published by the University of Washington in 2017 reported that the cost of basic living essentials for a family of four in Seattle had increased by $30,000 (62%) over the previous decade, with housing and childcare eating up half of a family's budget.

For middle-income families, home ownership can be a path to building wealth. How can a family maximize the return on their investment in home ownership? What factors affect the selling prices of midrange homes? What can homeowners do to increase the value of their homes while they occupy them?

This project addresses these questions using data from the King County Housing dataset. To make the results more relevant to middle-income families, my focus is on houses with two to five bedrooms and with a median price of $425,000.* The top 10% most expensive houses are omitted. The resulting multiple linear regression model can help middle-income families understand how changes within their reach, like adding a bedroom or remodeling a bathroom, can increase the sale price of their homes.

*Assuming excellent credit, a down-payment of 14%, and the ability to spend about 1/3 of monthly gross income on a mortgage payment, $425,000 would be an affordable home price for a family making King County's median household income of $89,675. I used NerdWallet's mortgage calculator to estimate this.

Summary of findings

(see Interpretation below for deeper discussion)

  • Adding a bathroom or increasing square footage by 100 square feet could raise the price of a home by over $8,300.
  • Improving the condition of a home by one level (e.g., from 3 to 4) could raise the price of the home by over $21,000.
  • Improving the grade of a home by one level (e.g., from 7 to 8) could raise the price of the home by over $42,000.

My model explains 82% of the variability in home prices.

About this repo

Here's what you'll find in this repository:

  • pricing-midrange-homes-king-co-wa.ipynb: Jupyter Notebook containing project code
  • column_names.md: Markdown file containing the data dictionary for the dataset provided by Flatiron School
  • kc_house_data.csv: CSV file containing the dataset provided by Flatiron School
  • presentation.pdf: PDF file containing slides from a non-technical presentation of project results

For more technical details, see the Appendix below.

Exploring the whole dataset

To get a sense for what is in the King County Housing dataset, first I will do some basic exploration of the entire dataset. After cleaning, another round of exploration will help clarify trends in the data specific to midrange housing.

First, I import the packages and data I need and take a look at the first few rows to see what's there.

# Import needed packages and read in the data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data = pd.read_csv('kc_house_data.csv')
# View first few rows of the dataset
data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 10/13/2014 221900.0 3 1.00 1180 5650 1.0 NaN 0.0 ... 7 1180 0.0 1955 0.0 98178 47.5112 -122.257 1340 5650
1 6414100192 12/9/2014 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 ... 7 2170 400.0 1951 1991.0 98125 47.7210 -122.319 1690 7639
2 5631500400 2/25/2015 180000.0 2 1.00 770 10000 1.0 0.0 0.0 ... 6 770 0.0 1933 NaN 98028 47.7379 -122.233 2720 8062
3 2487200875 12/9/2014 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 ... 7 1050 910.0 1965 0.0 98136 47.5208 -122.393 1360 5000
4 1954400510 2/18/2015 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 ... 8 1680 0.0 1987 0.0 98074 47.6168 -122.045 1800 7503

5 rows × 21 columns

Already I can see that there is some data that may not be useful at all. id appears to be a unique identifying number for each sale record. sqft_living appears to be the sum of sqft_above and sqft_basement. I can also see some missing values.

Let's check for missing values throughout the dataset.

Are there any missing values or duplicate observations?

Missing values can cause problems when it comes time to build my regression model, so I want to be sure to find and resolve them. If there are any duplicate observations, I will want to remove those.

First, let's check for missing values in each column.

# View counts and data types by column
data.info()
<class 'pandas.core.frame.DataFrame'>  
RangeIndex: 21597 entries, 0 to 21596  
Data columns (total 21 columns):  
id               21597 non-null int64  
date             21597 non-null object  
price            21597 non-null float64  
bedrooms         21597 non-null int64  
bathrooms        21597 non-null float64  
sqft_living      21597 non-null int64  
sqft_lot         21597 non-null int64  
floors           21597 non-null float64  
waterfront       19221 non-null float64  
view             21534 non-null float64  
condition        21597 non-null int64  
grade            21597 non-null int64  
sqft_above       21597 non-null int64  
sqft_basement    21597 non-null object  
yr_built         21597 non-null int64  
yr_renovated     17755 non-null float64  
zipcode          21597 non-null int64  
lat              21597 non-null float64  
long             21597 non-null float64  
sqft_living15    21597 non-null int64  
sqft_lot15       21597 non-null int64  
dtypes: float64(8), int64(11), object(2)  
memory usage: 3.5+ MB

Above I can see that there are a few columns with missing values (and some whose data types may need to be changed).

There's a more direct way to see how many missing values there are and where:

# Check for missing values by column
data.isna().sum()
id                  0  
date                0  
price               0  
bedrooms            0  
bathrooms           0  
sqft_living         0  
sqft_lot            0  
floors              0  
waterfront       2376  
view               63  
condition           0  
grade               0  
sqft_above          0  
sqft_basement       0  
yr_built            0  
yr_renovated     3842  
zipcode             0  
lat                 0  
long                0  
sqft_living15       0  
sqft_lot15          0  
dtype: int64

It looks like there are a lot of missing values in waterfront and yr_renovated and a few in view. These can be filled during the cleaning phase.

Let's check for duplicate observations. I'll do this two ways: first by searching for observations that are completely identical, and then by searching for duplicate values in the id column, since I believe that to be a unique identifier for each home sale.

# Check for duplicate records
print('Number of duplicate records: ', sum(data.duplicated()))
Number of duplicate records:  0
# Check for duplicate IDs
display(data['id'].value_counts().head())

# Count non-unique IDs
id_value_counts = data['id'].value_counts()
num_repeat_ids = len(id_value_counts[id_value_counts > 1])*2 + 1
print('Number of non-unique IDs: ', num_repeat_ids)
795000620     3  
1825069031    2  
2019200220    2  
7129304540    2  
1781500435    2  
Name: id, dtype: int64


Number of non-unique IDs:  353

I can see both from the value counts and from my calculation that there are a number of non-unique values in the id column, but these are few in relation to the overall size of the dataset.

If the records are not identical, and just the ids are repeated, how are these records different from one another?

# Inspect a few of the records with duplicate IDs
display(data[data['id'] == 795000620])
display(data[data['id'] == 1825069031])
# display(data[data['id'] == 2019200220])
# display(data[data['id'] == 7129304540])
# display(data[data['id'] == 1781500435])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
17588 795000620 9/24/2014 115000.0 3 1.0 1080 6250 1.0 0.0 0.0 ... 5 1080 0.0 1950 0.0 98168 47.5045 -122.33 1070 6250
17589 795000620 12/15/2014 124000.0 3 1.0 1080 6250 1.0 0.0 0.0 ... 5 1080 0.0 1950 0.0 98168 47.5045 -122.33 1070 6250
17590 795000620 3/11/2015 157000.0 3 1.0 1080 6250 1.0 NaN 0.0 ... 5 1080 0.0 1950 NaN 98168 47.5045 -122.33 1070 6250

3 rows × 21 columns

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
3946 1825069031 8/14/2014 550000.0 4 1.75 2410 8447 2.0 0.0 3.0 ... 8 2060 350.0 1936 1980.0 98074 47.6499 -122.088 2520 14789
3947 1825069031 10/16/2014 550000.0 4 1.75 2410 8447 2.0 NaN 3.0 ... 8 2060 350.0 1936 1980.0 98074 47.6499 -122.088 2520 14789

2 rows × 21 columns

On inspection, I can see that the observations really are different even when the id value is the same: different dates, different prices, etc. Since I expect to drop the id column before modeling anyway, I will leave these observations in the dataset and assume that they represent different sales of the same house or some other meaningful difference.

Are any of the variables categorical?

Categorical values can interfere with modeling if they are not handled correctly, so it is important to identify them early on. I'll use four approaches: viewing summary statistics, counting unique values, looking at histograms, and looking at scatter plots.

First, let's view summary statistics for each of the numeric columns and see if any of them have particularly small ranges, perhaps made up of mostly integers. (This is also a good opportunity to get a feel for the distributions of all the numeric variables.)

# Check summary statistics of numeric columns
data.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 2.159700e+04 2.159700e+04 21597.000000 21597.000000 21597.000000 2.159700e+04 21597.000000 19221.000000 21534.000000 21597.000000 21597.000000 21597.000000 21597.000000 17755.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000
mean 4.580474e+09 5.402966e+05 3.373200 2.115826 2080.321850 1.509941e+04 1.494096 0.007596 0.233863 3.409825 7.657915 1788.596842 1970.999676 83.636778 98077.951845 47.560093 -122.213982 1986.620318 12758.283512
std 2.876736e+09 3.673681e+05 0.926299 0.768984 918.106125 4.141264e+04 0.539683 0.086825 0.765686 0.650546 1.173200 827.759761 29.375234 399.946414 53.513072 0.138552 0.140724 685.230472 27274.441950
min 1.000102e+06 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 3.000000 370.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000
25% 2.123049e+09 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 1190.000000 1951.000000 0.000000 98033.000000 47.471100 -122.328000 1490.000000 5100.000000
50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 3.000000 7.000000 1560.000000 1975.000000 0.000000 98065.000000 47.571800 -122.231000 1840.000000 7620.000000
75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 1.068500e+04 2.000000 0.000000 0.000000 4.000000 8.000000 2210.000000 1997.000000 0.000000 98118.000000 47.678000 -122.125000 2360.000000 10083.000000
max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 13.000000 9410.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000

It is clear from the results that there are some wide ranges in some columns. price, for example, ranges from $78,000 to over \$7 million, and sqft_living from 370 to over 13,000 square feet. Other variables could be categorical, such as view and condition, where there seem to be few unique values representing levels of quality rather than some truly quantitative measure.

Let's check the number of unique values in each column to see if we find any other potential categorical variables.

# Check the number of unique values in each column
unique_vals_list = []
for col in data.columns:
    unique_vals_list.append({'column': col, 'unique values': len(data[col].unique())})
pd.DataFrame(unique_vals_list)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
column unique values
0 id 21420
1 date 372
2 price 3622
3 bedrooms 12
4 bathrooms 29
5 sqft_living 1034
6 sqft_lot 9776
7 floors 6
8 waterfront 3
9 view 6
10 condition 5
11 grade 11
12 sqft_above 942
13 sqft_basement 304
14 yr_built 116
15 yr_renovated 71
16 zipcode 70
17 lat 5033
18 long 751
19 sqft_living15 777
20 sqft_lot15 8682

It is clear that floors, waterfront, view, and condition have few unique values (and we have already seen that some of those values are missing anyway). bedrooms and grade might also be good candidates for one-hot encoding after we have narrowed the dataset to just the midrange homes.

Now let's check the histograms:

# Define a function to create histograms
def hist_it(data):
    
    """Creates histograms of all numeric columns in a DataFrame"""
    
    data.hist(figsize=(16,14))
# Create histograms for numerical variables
data_for_hist = data.drop(['id'], axis=1)

hist_it(data_for_hist)

Let's also take a look at scatter plots for each numeric variable. Coincidentally, the scatter matrix below will let us do an initial check for linear relationships between the independent variables and the target of our analysis, price.

# Create a scatter_matrix to look for categorical variables and linear relationships
pd.plotting.scatter_matrix(data.drop(['id'], axis=1), figsize=[16,14]);

Based on the plots, value counts, and summary statistics, I think that floors, waterfront, view, condition, and grade may all be candidates for one-hot encoding in the final dataset. There is also zipcode to consider; although there are many unique values in that column, a home's zipcode is more like a category it belongs to than a quantitative measure of some kind.

I choose to continue to treat all variables as continuous for the first round of modeling, but below I will create a copy of my dataset with zipcode one-hot encoded to try as well.

Are any of the independent variables strongly correlated with each other?

Multicollinearity (when two or more independent variables are strongly correlated with each other) can cause a model to appear better than it really is; those correlations amplify the effects of the correlated features. If such correlations are found, one variable from each correlated pair can be dropped from the model to counteract the effect of multicollinearity.

I will check for correlations among the features using a correlation matrix and a heatmap, below, and using 0.8 as the threshold for "strong" correlation.

# Create a boolean correlation matrix to highlight strong correlations
abs(data.corr()) > 0.8
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
id True False False False False False False False False False False False False False False False False False False
price False True False False False False False False False False False False False False False False False False False
bedrooms False False True False False False False False False False False False False False False False False False False
bathrooms False False False True False False False False False False False False False False False False False False False
sqft_living False False False False True False False False False False False True False False False False False False False
sqft_lot False False False False False True False False False False False False False False False False False False False
floors False False False False False False True False False False False False False False False False False False False
waterfront False False False False False False False True False False False False False False False False False False False
view False False False False False False False False True False False False False False False False False False False
condition False False False False False False False False False True False False False False False False False False False
grade False False False False False False False False False False True False False False False False False False False
sqft_above False False False False True False False False False False False True False False False False False False False
yr_built False False False False False False False False False False False False True False False False False False False
yr_renovated False False False False False False False False False False False False False True False False False False False
zipcode False False False False False False False False False False False False False False True False False False False
lat False False False False False False False False False False False False False False False True False False False
long False False False False False False False False False False False False False False False False True False False
sqft_living15 False False False False False False False False False False False False False False False False False True False
sqft_lot15 False False False False False False False False False False False False False False False False False False True
# Create the correlation heatmap
data_for_scatter_matrix = data.drop(['price', 'id'], axis=1)

plt.figure(figsize=(16,10))
sns.heatmap(data_for_scatter_matrix.corr(), center=0)
plt.title('Heatmap showing correlations between independent variables', fontsize=14)
plt.show();

The only pair of features with a correlation coefficient of 0.8 or above is sqft_living and sqft_above. As stated earlier, sqft_living is the sum of sqft_above and sqft_basement, so sqft_above will be dropped from the dataset anyway, avoiding any potential effects caused by multicollinearity.

Initial cleaning

Now that we have seen the basic characteristics of the whole dataset, it is time to trim it down to the part that is most relevant to middle-income families: houses with 2 to 5 bedrooms and not in the top 10% by price. This is also the time to resolve any missing values, drop unneeded columns, and then reexamine the data visually to understand the characteristics of our more focused dataset.

These are the main cleaning steps I will take:

  • Filter the dataset to observations with 2-5 bedrooms.
  • Filter the dataset to observations with price below the 90th percentile.
  • Resolve missing values in the waterfront, view, and yr_renovated columns.
  • Calculate a new column, month, from the date column.
  • Drop the following columns: id, date, sqft_above, sqft_basement.

Filter to focus on midrange homes

First, let's focus in on the midrange homes by filtering out the biggest and most expensive ones.

# Filter the dataset
midrange_homes = data[(data['price'] < np.quantile(data['price'], 0.9)) 
                         & (data['bedrooms'].isin(range(2, 6)))]

# View the first few rows
midrange_homes.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 10/13/2014 221900.0 3 1.00 1180 5650 1.0 NaN 0.0 ... 7 1180 0.0 1955 0.0 98178 47.5112 -122.257 1340 5650
1 6414100192 12/9/2014 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 ... 7 2170 400.0 1951 1991.0 98125 47.7210 -122.319 1690 7639
2 5631500400 2/25/2015 180000.0 2 1.00 770 10000 1.0 0.0 0.0 ... 6 770 0.0 1933 NaN 98028 47.7379 -122.233 2720 8062
3 2487200875 12/9/2014 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 ... 7 1050 910.0 1965 0.0 98136 47.5208 -122.393 1360 5000
4 1954400510 2/18/2015 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 ... 8 1680 0.0 1987 0.0 98074 47.6168 -122.045 1800 7503

5 rows × 21 columns

Let's check the effect of our filtering by looking at the summary statistics again:

# View summary statistics
midrange_homes.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 1.899900e+04 18999.000000 18999.000000 18999.000000 18999.000000 1.899900e+04 18999.000000 16906.000000 18943.000000 18999.000000 18999.000000 18999.000000 18999.000000 15614.000000 18999.000000 18999.000000 18999.000000 18999.000000 18999.000000
mean 4.623793e+09 451033.600453 3.285910 2.012816 1912.888047 1.413348e+04 1.463682 0.002189 0.145331 3.405127 7.464498 1659.656298 1971.048424 66.947931 98078.543344 47.553219 -122.212729 1882.645929 12188.540871
std 2.878978e+09 177148.927445 0.782848 0.672443 706.241075 3.852339e+04 0.536461 0.046732 0.582127 0.643411 0.959870 671.456125 28.847826 359.290326 53.294178 0.143337 0.142935 574.261538 25473.821194
min 1.200019e+06 78000.000000 2.000000 0.500000 440.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 4.000000 440.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000
25% 2.154950e+09 310000.000000 3.000000 1.500000 1380.000000 5.000000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 1160.000000 1952.000000 0.000000 98033.000000 47.451150 -122.330000 1460.000000 5040.000000
50% 3.943600e+09 425000.000000 3.000000 2.000000 1810.000000 7.486000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 1480.000000 1975.000000 0.000000 98065.000000 47.562200 -122.233000 1780.000000 7515.000000
75% 7.350400e+09 570000.000000 4.000000 2.500000 2340.000000 1.015000e+04 2.000000 0.000000 0.000000 4.000000 8.000000 2030.000000 1996.000000 0.000000 98118.000000 47.680700 -122.122000 2230.000000 9750.000000
max 9.900000e+09 886000.000000 5.000000 5.000000 5820.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 12.000000 5450.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 4670.000000 560617.000000

Bedrooms now range from 2 to 5, and the median price is now $425,000.

Let's check again for missing values; some may have been eliminated by the filtering.

# Check for missing values by column
midrange_homes.isna().sum()
id                  0  
date                0  
price               0  
bedrooms            0  
bathrooms           0  
sqft_living         0  
sqft_lot            0  
floors              0  
waterfront       2093  
view               56  
condition           0  
grade               0  
sqft_above          0  
sqft_basement       0  
yr_built            0  
yr_renovated     3385  
zipcode             0  
lat                 0  
long                0  
sqft_living15       0  
sqft_lot15          0  
dtype: int64

Resolve missing values

waterfront

I can see above that there are still 2,093 missing values in the waterfront column. I'll investigate further.

# View value counts for 'waterfront'
midrange_homes['waterfront'].value_counts()
0.0    16869  
1.0       37  
Name: waterfront, dtype: int64

Out of about 19,000 observations, only 37 have a value of 1 (presumably meaning "yes") in the waterfront column. If we take a quick look at median prices of homes with waterfront == 1 versus those where it equals 0...

# Print medians of homes with and without 'waterfront'
print(midrange_homes[midrange_homes['waterfront'] == 1]['price'].median())
print(midrange_homes[midrange_homes['waterfront'] == 0]['price'].median())
635000.0

425000.0

...it is clear that waterfront probably has a pretty strong effect on price, but the vast majority of houses are not experiencing this effect. For this reason, I will fill missing waterfront values with 0, assuming that when the creators of this dataset left a value blank, the property was probably not on the waterfront anyway.

# Fill NaNs with 0.0 because it is the mode
midrange_homes['waterfront'] = midrange_homes['waterfront'].fillna(0.0)
midrange_homes['waterfront'] = midrange_homes['waterfront'].astype('int64')
midrange_homes.info()
<class 'pandas.core.frame.DataFrame'>  
Int64Index: 18999 entries, 0 to 21596  
Data columns (total 21 columns):  
id               18999 non-null int64  
date             18999 non-null object  
price            18999 non-null float64  
bedrooms         18999 non-null int64  
bathrooms        18999 non-null float64  
sqft_living      18999 non-null int64  
sqft_lot         18999 non-null int64  
floors           18999 non-null float64  
waterfront       18999 non-null int64  
view             18943 non-null float64  
condition        18999 non-null int64  
grade            18999 non-null int64  
sqft_above       18999 non-null int64  
sqft_basement    18999 non-null object  
yr_built         18999 non-null int64  
yr_renovated     15614 non-null float64  
zipcode          18999 non-null int64  
lat              18999 non-null float64  
long             18999 non-null float64  
sqft_living15    18999 non-null int64  
sqft_lot15       18999 non-null int64  
dtypes: float64(7), int64(12), object(2)  
memory usage: 3.2+ MB

view

The documentation for this dataset is not very clear on what view represents. It seems likely, given what little has been said about it, that view gives some indication of whether or how much a home was viewed prior to sale. Because the maximum value of view is only 4, it seems likely that this is not an absolute count, but perhaps some kind of "index" or "tier" of how much viewing went on (i.e., a categorical variable). If we take a quick look at the distribution of values in view...

# Create a histogram of 'view' values
plt.figure(figsize=(10,6))
midrange_homes['view'].hist()
plt.title('Histogram of \'view\' values')
plt.xlabel('\'view\' values')
plt.ylabel('Count')
plt.show();

...we can see that mode of values in this column is 0 by a long shot. We can see from the summary statistics table (above) that 0 is also the minimum value, the 25th percentile, the median, and the 75% percentile. For these reasons, I will reassign all missing values to 0, since this will not change the overall distribution of the data.

# Fill NaNs with 0.0 and check that missing `view` values are now resolved
midrange_homes['view'] = midrange_homes['view'].fillna(0.0).astype('int64')
midrange_homes.info()
<class 'pandas.core.frame.DataFrame'>  
Int64Index: 18999 entries, 0 to 21596  
Data columns (total 21 columns):  
id               18999 non-null int64  
date             18999 non-null object  
price            18999 non-null float64  
bedrooms         18999 non-null int64  
bathrooms        18999 non-null float64  
sqft_living      18999 non-null int64  
sqft_lot         18999 non-null int64  
floors           18999 non-null float64  
waterfront       18999 non-null int64  
view             18999 non-null int64  
condition        18999 non-null int64  
grade            18999 non-null int64  
sqft_above       18999 non-null int64  
sqft_basement    18999 non-null object  
yr_built         18999 non-null int64  
yr_renovated     15614 non-null float64  
zipcode          18999 non-null int64  
lat              18999 non-null float64  
long             18999 non-null float64  
sqft_living15    18999 non-null int64  
sqft_lot15       18999 non-null int64  
dtypes: float64(6), int64(13), object(2)  
memory usage: 3.2+ MB

yr_renovated

From the histogram below, it is clear that most yr_renovated values are 0. I assume this means that a house has not had substantial renovation since its construction.

# Create a histogram of 'yr_renovated' values
plt.figure(figsize=(10,6))
midrange_homes['yr_renovated'].hist()
plt.title('Histogram of \'yr_renovated\' values')
plt.xlabel('\'yr_renovated\' values')
plt.ylabel('Count')
plt.show();

Since 0 is the mode, I will replace the missing values with 0:

# Fill NaNs with 0 and cast to dtype 'int64'
midrange_homes['yr_renovated'] = midrange_homes['yr_renovated'].fillna(0)
midrange_homes['yr_renovated'] = midrange_homes['yr_renovated'].astype('int64')

# Confirm that all missing values have been resolved
midrange_homes.info()
<class 'pandas.core.frame.DataFrame'>  
Int64Index: 18999 entries, 0 to 21596  
Data columns (total 21 columns):  
id               18999 non-null int64  
date             18999 non-null object  
price            18999 non-null float64  
bedrooms         18999 non-null int64  
bathrooms        18999 non-null float64  
sqft_living      18999 non-null int64  
sqft_lot         18999 non-null int64  
floors           18999 non-null float64  
waterfront       18999 non-null int64  
view             18999 non-null int64  
condition        18999 non-null int64  
grade            18999 non-null int64  
sqft_above       18999 non-null int64  
sqft_basement    18999 non-null object  
yr_built         18999 non-null int64  
yr_renovated     18999 non-null int64  
zipcode          18999 non-null int64  
lat              18999 non-null float64  
long             18999 non-null float64  
sqft_living15    18999 non-null int64  
sqft_lot15       18999 non-null int64  
dtypes: float64(5), int64(14), object(2)  
memory usage: 3.2+ MB

Create the 'month' column

I want to have the option to explore seasonality in sale prices. Since the dataset comes from 2014-2015 and I want to be able to generalize to other years, I will focus on the month in which each sale was made. I will extract the number of the month from each date into a new column.

# Create 'month' column
midrange_homes['month'] = [x.split('/')[0] for x in midrange_homes['date']]
midrange_homes['month'] = pd.to_numeric(midrange_homes['month'], 
                                        errors='coerce')

# Confirm that 'month' exists and is numeric
midrange_homes.info()
<class 'pandas.core.frame.DataFrame'>  
Int64Index: 18999 entries, 0 to 21596  
Data columns (total 22 columns):  
id               18999 non-null int64  
date             18999 non-null object  
price            18999 non-null float64  
bedrooms         18999 non-null int64  
bathrooms        18999 non-null float64  
sqft_living      18999 non-null int64  
sqft_lot         18999 non-null int64  
floors           18999 non-null float64  
waterfront       18999 non-null int64  
view             18999 non-null int64  
condition        18999 non-null int64  
grade            18999 non-null int64  
sqft_above       18999 non-null int64  
sqft_basement    18999 non-null object  
yr_built         18999 non-null int64  
yr_renovated     18999 non-null int64  
zipcode          18999 non-null int64  
lat              18999 non-null float64  
long             18999 non-null float64  
sqft_living15    18999 non-null int64  
sqft_lot15       18999 non-null int64  
month            18999 non-null int64  
dtypes: float64(5), int64(15), object(2)  
memory usage: 3.3+ MB

Drop unneeded columns

I have decided to drop the following columns for the reasons given below:

  • id: The values in this column simply identify each observation, and I don't expect them to be useful for modeling.

  • date: The whole date is not analytically interesting to me, but the month is, so I extracted that data to a separate column.

  • sqft_above and sqft_basement: As stated above, sqft_living appears to be the sum of these two values. Since there is not much a homeowner could do to shift the balance of 'above' space to basement, I don't expect these features to be useful for my model.

# Drop unneeded columns
midrange_homes.drop(['id', 'date', 'sqft_above', 'sqft_basement'], 
                    axis=1, inplace=True)

# Review the remaining columns
midrange_homes.info()
<class 'pandas.core.frame.DataFrame'>  
Int64Index: 18999 entries, 0 to 21596  
Data columns (total 18 columns):  
price            18999 non-null float64  
bedrooms         18999 non-null int64  
bathrooms        18999 non-null float64  
sqft_living      18999 non-null int64  
sqft_lot         18999 non-null int64  
floors           18999 non-null float64  
waterfront       18999 non-null int64  
view             18999 non-null int64  
condition        18999 non-null int64  
grade            18999 non-null int64  
yr_built         18999 non-null int64  
yr_renovated     18999 non-null int64  
zipcode          18999 non-null int64  
lat              18999 non-null float64  
long             18999 non-null float64  
sqft_living15    18999 non-null int64  
sqft_lot15       18999 non-null int64  
month            18999 non-null int64  
dtypes: float64(5), int64(13)  
memory usage: 2.8 MB

A look at the .info() above shows that we now have 18,999 rows and 18 columns. All columns are of numeric data types, and there are no missing values.

Exploring just the midrange homes

General exploration

Let's revisit our histograms now that the data has been cleaned.

# Create histograms of numeric features
hist_it(midrange_homes)

Note that selecting the homes with lower prices and 2-5 bedrooms has also caused the ranges of some other variables to shrink. grade, for instance, now only goes up to 10, sqft_living to about 5,000, and bathrooms up to 5.

Now let's examine scatterplots of each feature against price:

# Create scatterplots of all columns v. price
for column in midrange_homes.drop('price', axis=1):
    midrange_homes.plot(kind='scatter', x=column, y='price', alpha=0.3)
    plt.show();

Most features appear to have a more or less linear relationship with price. The distribution of points in zipcode makes me wonder if there are actually two subsets in this data: homes in the Seattle area (with zipcodes of 98100 and up), and homes elsewhere in the county. Depending on the results of my first attempt at modeling, I may decide to reexamine the effect that location may be having on the model.

Now that the dataset has been trimmed down a bit, let's re-check for multicollinearity.

# Create a boolean correlation matrix to highlight strong correlations
abs(midrange_homes.corr()) > 0.8
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 month
price True False False False False False False False False False False False False False False False False False
bedrooms False True False False False False False False False False False False False False False False False False
bathrooms False False True False False False False False False False False False False False False False False False
sqft_living False False False True False False False False False False False False False False False False False False
sqft_lot False False False False True False False False False False False False False False False False False False
floors False False False False False True False False False False False False False False False False False False
waterfront False False False False False False True False False False False False False False False False False False
view False False False False False False False True False False False False False False False False False False
condition False False False False False False False False True False False False False False False False False False
grade False False False False False False False False False True False False False False False False False False
yr_built False False False False False False False False False False True False False False False False False False
yr_renovated False False False False False False False False False False False True False False False False False False
zipcode False False False False False False False False False False False False True False False False False False
lat False False False False False False False False False False False False False True False False False False
long False False False False False False False False False False False False False False True False False False
sqft_living15 False False False False False False False False False False False False False False False True False False
sqft_lot15 False False False False False False False False False False False False False False False False True False
month False False False False False False False False False False False False False False False False False True
# Create the correlation heatmap
data_for_scatter_matrix = midrange_homes.drop(['price'], axis=1)

plt.figure(figsize=(16,10))
sns.heatmap(data_for_scatter_matrix.corr(), center=0)
plt.title('Heatmap showing correlations between independent variables', 
          fontsize=18)
plt.show();

Good news: there appear to be no correlations with coefficients higher than my chosen threshold of 0.8.

Specific questions

Having seen the histograms and scatterplots, I now want to examine a few features in detail to answer the following questions:

  • Is there a number of bedrooms that causes a big jump up in price (e.g., three versus two, or four versus three)?

  • Are there times of year when prices tend to be lower or higher?

  • What is the difference in price between a home of a middling grade (e.g., 7) and one of a higher grade (e.g., 10)?

What is the relationship between bedrooms and price?

Is there a number of bedrooms that causes a big jump up in price (e.g., three versus two, or four versus three)? Let's examine some boxplots to find out.

# Create boxplots for 'bedrooms' v. 'price'
plt.figure(figsize=(10,8))
sns.boxplot(midrange_homes['bedrooms'], midrange_homes['price'], color='skyblue')
plt.title('Distributions of prices for each number of bedrooms', fontsize=18)
plt.xlabel('Number of bedrooms')
plt.ylabel('Price (USD)')
plt.show();

# Calculate percent differences in median prices
medians = []

for n in range(2,6):
    medians.append(midrange_homes[midrange_homes['bedrooms'] == n]['price'].median())

percent_differences = []
for m in range(0,len(medians)-1):
    percent_differences.append(round(((medians[m+1] - medians[m]) / medians[m]),2))
    
percent_differences   
[0.08, 0.22, 0.07]

The biggest difference in median price is between four and three bedrooms, where there is an increase of 22%.

Are there seasonal trends in price?

Are there times of year when prices tend to be lower or higher?

# Create a boxplot to compare 'month' and 'price'
plt.figure(figsize=(10,8))
sns.boxplot(midrange_homes['month'], midrange_homes['price'], color='skyblue')
plt.title('Distributions of prices for each month', fontsize=18)
plt.xlabel('Month (January = 1)')
plt.ylabel('Price (USD)')
plt.show();

Although houses sell at a wide range of prices every month, median prices appear to be higher in April through July than in November through February.

Let's compare April and January median prices directly:

april_med = midrange_homes[midrange_homes['month'] == 4]['price'].median()
january_med = midrange_homes[midrange_homes['month'] == 1]['price'].median()

april_med - january_med
41000.0

There is a difference in median house price of $41,000 between April and January! Although this analysis doesn't consider any other factors (like whether houses sold in April are bigger than those sold in January), there could be an advantage to timing the sale of one's home for the spring/summer as opposed to winter.

What is the relationship between grade and price?

What is the difference in price between a home of a middling grade (e.g., 7) and one of a higher grade (e.g., 10)?

# Create boxplots to compare 'grade' and 'price'
plt.figure(figsize=(10,8))
sns.boxplot(midrange_homes['grade'], midrange_homes['price'], color='skyblue')
plt.title('Distributions of prices for each grade', fontsize=18)
plt.xlabel('Grade')
plt.ylabel('Price (USD)')
plt.show();

It looks like there could be substantial differences in price based on the grade of a house. For instance, only the outliers of grade-5 houses fall within the price range of grade-11 houses.

Let's make a direct comparison between the median prices of grade-7 and grade-10 homes:

grade_7_med = midrange_homes[midrange_homes['grade'] == 7]['price'].median()
grade_10_med = midrange_homes[midrange_homes['grade'] == 10]['price'].median()

grade_10_med - grade_7_med
378805.0

There is a huge difference (almost $379,000) between the median prices of grade-7 and grade-10 homes. Improving the grade of a home by that much is probably outside the reach of most homeowners. What if a homeowner could improve the grade of their home from 7 to 8?

grade_8_med = midrange_homes[midrange_homes['grade'] == 8]['price'].median()

grade_8_med - grade_7_med
126500.0

Based on the boxplots above, we can see that the jump in median price from grade 7 to grade 8 is a big one, but if a homeowner could manage it, it could pay off. The median price of a grade-8 home is $126,500 higher than the median price of a grade-7 home. Again, this is without considering any other factors, like the size or condition of these homes.

Where are the midrange homes in King County?

Because King County is home to Seattle, one of the most expensive housing markets in the US, there is probably some spatial patterning in home price, size, and quality across the county. Although a home-owning family probably cannot move their house to a new location, a family shopping for a home would want to know where they might find homes in their price range, or how much more house they could get by living a little farther from central Seattle.

Let's use lat and long to take a look at where midrange homes can be found in King County.

# Define a function to create map-like scatter plots with color code
def locate_it(data, latitude, longitude, feature):
    
    """Create a scatterplot from lat/long data with color code.
    Parameters:
        data: a DataFrame
        latitude: the name of the column in your DataFrame that contains
            the latitude values. Pass this as a string.
        longitude: the name of the column in your DataFrame that contains
            the longitude values. Pass this as a string.
        feature: the name of the column whose values you want to use as 
            the values for your color code. Pass this as a string.
    Dependencies: matplotlib
    Returns: scatterplot"""
    
    plt.figure(figsize=(16,12))
    cmap = plt.cm.get_cmap('RdYlBu')
    sc = plt.scatter(data[longitude], data[latitude], 
                     c=data[feature], vmin=min(data[feature]), 
                     vmax=max(data[feature]), alpha=0.5, s=5, cmap=cmap)
    plt.colorbar(sc)
    plt.xlabel('Longitude')
    plt.ylabel('Latitude')
    plt.title('House {} by location'.format(feature), fontsize=18)
    plt.show();
# Call locate_it for price by location
locate_it(midrange_homes, 'lat', 'long', 'price')

This scatterplot confirms that the most expensive midrange homes (blue dots) can be found in Seattle, around Lake Washington, and in some other waterfront locations. Less expensive homes (red dots) are concentrated in the southwest part of the county.

# Call locate_it for sqft_living by location
locate_it(midrange_homes, 'lat', 'long', 'sqft_living')

Midrange homes tend to be smaller, which is why there are so few blue dots on this plot. It's hard to understand the range of home sizes represented here, so I'll plot it again with some adjustments to the color scale.

# Customize the plot for sqft_living by location
plt.figure(figsize=(16,12))
cmap = plt.cm.get_cmap('RdYlBu')
sc = plt.scatter(midrange_homes['long'], midrange_homes['lat'], 
                 c=midrange_homes['sqft_living'], 
                 vmin=min(midrange_homes['sqft_living']), 
                 vmax=np.percentile(midrange_homes['sqft_living'], 90), 
                 alpha=0.5, s=5, cmap=cmap)
plt.colorbar(sc)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('House square footage by location\n(Darkest blue = 90th percentile of size)', fontsize=14)
plt.show();

That's a little easier to understand. Midrange homes over 2,000 square feet can be found across much of western King County except in Seattle proper, where they are less common.

What about house quality? Let's visualize homes by grade.

# Call locate_it for grade by location
locate_it(midrange_homes, 'lat', 'long', 'grade')

This plot shows that the highest quality midrange homes by grade can be found to the east of Lake Washington, while there seem to be more low-quality homes in the central and southwestern parts of the county. Overall, midrange homes tend to have grades in the middle of the range of possible values, from 6 to 10.

Finally, let's look at the condition of midrange homes.

locate_it(midrange_homes, 'lat', 'long', 'condition')

This plot contains some good news for middle-income home buyers: homes in good condition (blue dots) can be found all over King County, and midrange homes are almost all of middling condition or better.

Final preprocessing

I anticipate that I may want to examine the effect that location (expressed through zipcode) has on price, so I'll create a copy of my midrange_homes DataFrame with dummy variables for the various zipcodes.

# Generate dummy variables
zip_dummies = pd.get_dummies(midrange_homes['zipcode'], prefix='zip')
# Drop the original 'zipcode' column
mh_no_zips = midrange_homes.drop('zipcode', axis=1)

# Concatenate the dummies to the copy of 'midrange_homes'
mh_zips_encoded = pd.concat([mh_no_zips, zip_dummies], axis=1)

# Preview the new DataFrame
mh_zips_encoded.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade ... zip_98146 zip_98148 zip_98155 zip_98166 zip_98168 zip_98177 zip_98178 zip_98188 zip_98198 zip_98199
0 221900.0 3 1.00 1180 5650 1.0 0 0 3 7 ... 0 0 0 0 0 0 1 0 0 0
1 538000.0 3 2.25 2570 7242 2.0 0 0 3 7 ... 0 0 0 0 0 0 0 0 0 0
2 180000.0 2 1.00 770 10000 1.0 0 0 3 6 ... 0 0 0 0 0 0 0 0 0 0
3 604000.0 4 3.00 1960 5000 1.0 0 0 5 7 ... 0 0 0 0 0 0 0 0 0 0
4 510000.0 3 2.00 1680 8080 1.0 0 0 3 8 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 87 columns

# Drop one of the dummy variables
mh_zips_encoded.drop('zip_98168', axis=1, inplace=True)

# Check the head again
mh_zips_encoded.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade ... zip_98144 zip_98146 zip_98148 zip_98155 zip_98166 zip_98177 zip_98178 zip_98188 zip_98198 zip_98199
0 221900.0 3 1.00 1180 5650 1.0 0 0 3 7 ... 0 0 0 0 0 0 1 0 0 0
1 538000.0 3 2.25 2570 7242 2.0 0 0 3 7 ... 0 0 0 0 0 0 0 0 0 0
2 180000.0 2 1.00 770 10000 1.0 0 0 3 6 ... 0 0 0 0 0 0 0 0 0 0
3 604000.0 4 3.00 1960 5000 1.0 0 0 5 7 ... 0 0 0 0 0 0 0 0 0 0
4 510000.0 3 2.00 1680 8080 1.0 0 0 3 8 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 86 columns

Once I have seen how my model performs on midrange_homes, I can try again with mh_zips_encoded to see if treating zipcode as a categorical variable improves model performance.

Creating the model

First model

Working on the data in midrange_homes, I will use multiple linear regression to try to create a model that explains at least 70% of the variability in price.

# Import statsmodels
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Split the cleaned data into features and target
mh_features = midrange_homes.drop(['price'], axis=1)
mh_target = midrange_homes['price']
# Define a function to run OLS and return model summary
def model_it(data, features):
    
    """Fit an OLS model and return model summary
    data: a DataFrame containing both features and target
    features: identical to 'data', but with the target dropped"""
    
    features_sum = '+'.join(features.columns)
    formula = 'price' + '~' + features_sum

    model = ols(formula=formula, data=data).fit()
    return model.summary()
# Fit the model and return summary
model_it(midrange_homes, mh_features)

OLS Regression Results*

R-squared: 0.675

Adj. R-squared: 0.675

(See project notebook for full printout of model results.)

The first attempt at this model yields some interesting results. An adjusted R-squared value of 0.675 means that the model explains 67.5% of the variability in price, which is close to my goal of 70%. Near the bottom of the model summary, the skew and kurtosis are not too far from the target values (skew = 0, kurtosis = 3) that would mean that the residuals are normally distributed. The Jarque-Bera score of 866 is very high, however, and that suggests that the residuals are not normally distributed--not a good thing for linear regression!

Let's see if the model can be improved by dropping the two features with p-values over my chosen threshold of 0.05: long and sqft_lot15. Those high p-values indicate that the effects of those features are not significant, and reducing the overall number of features could improve the adjusted R-squared value.

Second model

I'll run the model again, this time minus the two features whose p-values were above my significance threshold of 0.05 (long and sqft_lot15).

# Drop unwanted features and rerun the modeling function
mh_features_fewer = mh_features.drop(['long', 'sqft_lot15'], axis=1)

model_it(midrange_homes, mh_features_fewer)

OLS Regression Results*

R-squared: 0.675

Adj. R-squared: 0.675

(See project notebook for full printout of model results.)

Dropping those features did not change the adjusted R-squared at all, but now all of the feature coefficients are significant. At least we can be certain that the impacts those features have on price are significant.

From the coefficients, we can get a few important insights:

  • For every additional square foot of living space, we can expect the price of a home to go up by $62.99.
  • A step up in condition could raise the price by over $23,900.
  • A step up in grade could raise the price by over $65,300.

Third model

This time I will try to fit a model to my alternate DataFrame, mh_zips_encoded, which uses one-hot encoding on the original zipcode feature.

# Split the data into features and target
mh_zips_encoded_features = mh_zips_encoded.drop(['price'], axis=1)
mh_zips_encoded_target = mh_zips_encoded['price']
model_it(mh_zips_encoded, mh_zips_encoded_features)

OLS Regression Results

R-squared: 0.821

Adj. R-squared: 0.821

(See project notebook for full printout of model results.)

That's a huge improvement in model performance, with an adjusted R-squared value of 0.821 compared to 0.675 without one-hot encoding of zipcodes. Many of the features, however, are not having impacts that are statistically significant (their p-values are over my chosen threshold of 0.05). Now it's time to see if I can eliminate some of the less significant features while maintaining or improving model performance.

Stepwise feature selection

I will use stepwise feature selection to try to narrow my list of features to only those that are really affecting price. I got the code for stepwise_selection below from the Flatiron School data science curriculum.

# Paste in code for 'stepwise_selection' function
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out = 0.05, 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    based on p-value from statsmodels.api.OLS
    Parameters:
        X: pandas.DataFrame with candidate features
        y: list-like with the target
        initial_list: list of features to start with (column names of X)
        threshold_in: include a feature if its p-value < threshold_in
        threshold_out: exclude a feature if its p-value > threshold_out
        verbose: whether to print the sequence of inclusions and exclusions
    Returns: list of selected features 
    Always set threshold_in < threshold_out to avoid infinite looping.
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included
result = stepwise_selection(mh_zips_encoded_features, mh_zips_encoded_target, verbose = True)
print('resulting features:')
print(result)

resulting features:

['bathrooms', 'sqft_living15', 'grade', 'yr_built', 'lat', 'sqft_living', 'zip_98155', 'zip_98028', 'zip_98004', 'view', 'zip_98019', 'zip_98040', 'zip_98011', 'condition', 'zip_98022', 'zip_98014', 'sqft_lot', 'zip_98178', 'zip_98112', 'zip_98116', 'zip_98119', 'zip_98058', 'zip_98006', 'zip_98136', 'waterfront', 'zip_98005', 'zip_98199', 'zip_98109', 'zip_98010', 'zip_98122', 'zip_98029', 'zip_98055', 'month', 'zip_98102', 'zip_98105', 'zip_98027', 'zip_98075', 'zip_98103', 'zip_98031', 'zip_98107', 'zip_98188', 'zip_98039', 'yr_renovated', 'zip_98115', 'zip_98117', 'zip_98033', 'zip_98126', 'zip_98144', 'zip_98008', 'zip_98070', 'long', 'zip_98007', 'zip_98052', 'zip_98074', 'zip_98118', 'zip_98053', 'zip_98166', 'zip_98042', 'zip_98032', 'zip_98146', 'zip_98030', 'zip_98106', 'zip_98059', 'zip_98092', 'zip_98023', 'zip_98177', 'zip_98125', 'zip_98056', 'zip_98034', 'zip_98108', 'zip_98065', 'zip_98024']

(See project notebook for full printout.)

Now I will rerun the model using just the features selected by stepwise_selection.

Note that I am using the list of features selected when I applied stepwise selection the first time; rerunning this notebook may produce slightly different results in the list above than what you see in the cell below.

# Make a new DataFrame using just the selected features
stepwise_selections = ['grade', 'yr_built', 'lat', 'sqft_living15', 
                       'sqft_living', 'zip_98155', 'zip_98028', 'zip_98004', 
                       'view', 'zip_98019', 'zip_98040', 'zip_98011', 
                       'condition', 'zip_98022', 'zip_98014', 'sqft_lot', 
                       'zip_98178', 'zip_98112', 'zip_98116', 'zip_98119', 
                       'zip_98058', 'zip_98136', 'zip_98006', 'waterfront', 
                       'zip_98109', 'zip_98005', 'zip_98199', 'zip_98122', 
                       'zip_98010', 'zip_98029', 'zip_98102', 'month', 
                       'zip_98055', 'zip_98105', 'zip_98027', 'zip_98103', 
                       'zip_98107', 'zip_98075', 'zip_98115', 'zip_98117', 
                       'zip_98033', 'zip_98126', 'zip_98144', 'zip_98008', 
                       'yr_renovated', 'zip_98039', 'long', 'zip_98070', 
                       'zip_98007', 'zip_98052', 'zip_98118', 'zip_98074', 
                       'zip_98053', 'zip_98166', 'zip_98031', 'bathrooms', 
                       'zip_98042', 'zip_98146', 'zip_98106', 'zip_98032', 
                       'zip_98030', 'zip_98188', 'zip_98059', 'zip_98092', 
                       'zip_98023', 'zip_98177', 'zip_98125', 'zip_98056', 
                       'zip_98034', 'zip_98108', 'zip_98065', 'zip_98024' ]

cols_to_drop = ([col for col in mh_zips_encoded.columns 
                 if not col in stepwise_selections])

mh_selected = mh_zips_encoded.drop(cols_to_drop, axis=1)
# Preview the new DataFrame
mh_selected.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
bathrooms sqft_living sqft_lot waterfront view condition grade yr_built yr_renovated lat ... zip_98126 zip_98136 zip_98144 zip_98146 zip_98155 zip_98166 zip_98177 zip_98178 zip_98188 zip_98199
0 1.00 1180 5650 0 0 3 7 1955 0 47.5112 ... 0 0 0 0 0 0 0 1 0 0
1 2.25 2570 7242 0 0 3 7 1951 1991 47.7210 ... 0 0 0 0 0 0 0 0 0 0
2 1.00 770 10000 0 0 3 6 1933 0 47.7379 ... 0 0 0 0 0 0 0 0 0 0
3 3.00 1960 5000 0 0 5 7 1965 0 47.5208 ... 0 1 0 0 0 0 0 0 0 0
4 2.00 1680 8080 0 0 3 8 1987 0 47.6168 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 72 columns

Final model

Finally, I will try to fit the multiple linear regression model again, this time using my new DataFrame, which contains only the features chosen by stepwise selection.

# Prepare the inputs and call model_it
model_it(mh_zips_encoded, mh_selected)

OLS Regression Results

R-squared: 0.821

Adj. R-squared: 0.820

(See project notebook for full printout of model results.)

The model summary above shows that eliminating some features through stepwise selection caused the adjusted R-squared value to drop by 0.001 while eliminating those features with p-values over 0.05. Not bad! This model summary can give us some important insights about home pricing.

According to this model:

  • Adding a bathroom or increasing square footage by 100 square feet could raise the price of a home by over $8,300.

  • Improving the condition of a home by one level (e.g., from 3 to 4) could raise the price of the home by over $21,000.

  • Improving the grade of a home by one level (e.g., from 7 to 8) could raise the price of the home by over $42,000.

Finally, some interpretation of the many zipcode features:

The zipcodes that had a significant positive impact on price are numerous, and they come from both the Seattle metro area (zipcodes over 98100) and elsewhere in King County. This means that there are places throughout King County where location will add to the value of a home. This is probably because there are different ways that location can add value, especially in conjunction with other features. If a home is in the Seattle metro area, the location could be adding value because many people want to live in the city or within commuting distance to their jobs in the city center. If a home is outside of the Seattle metro area, it's possible to have a bigger lot, more square footage, etc., so location can add value through its relationship to other features. Homeowners can't do much to change the location of the home they own, but an awareness of how location is effecting their home's value could help them make decisions about what sorts of improvements to make. Location awareness could also help prospective home buyers understand the tradeoffs of buying a home of this or that size or quality in different locations.

I think that this final model delivers the best and most helpful results for my middle-income family audience, so I will proceed with analyzing this one using Q-Q plots.

Testing for linear model assumptions

Now that I have settled on my best model, I want to make sure that the three main assumptions for linear models hold true:

  • Each feature has a linear relationship with the target.

  • The residuals are normally distributed.

  • The residuals are homoscedastic (i.e., there is an equal chance of having small or large residuals along the range of each feature).

I already decided that most of my features had linear relationships to the target when I inspected the scatterplots above. Now I'll use Q-Q plots to check the residuals for normality and homoscedasticity.

# Import needed package, prep input, and generate Q-Q plots
from scipy import stats

columns_for_qq = mh_zips_encoded_features.columns

for column in columns_for_qq:
    formula = 'price' + '~' + column
    model = ols(formula=formula, data=mh_zips_encoded).fit()
    sm.graphics.qqplot(model.resid, dist=stats.norm, line='45', fit=True)
    plt.title('QQ plot for ' + str(column))
    plt.show();
    fig = plt.figure(figsize=(16,14))
    fig = sm.graphics.plot_regress_exog(model, column, fig=fig)

(See the project notebook in this repository for a full printout of the Q-Q plots.)

My residuals definitely aren't perfectly normal, but they aren't too bad for some of the features, like bedrooms, bathrooms, sqft_living, and grade. Although there is plenty of room for improvement in my model (and lots that it can't explain, simply because the original data are limited), I think it is worth proceeding to the validation stage with this model.

Validating the model

To ensure that my model could be used to generalize to houses outside my dataset, I will validate it using k-fold cross-validation. Because I am going to use a function from Scikit-Learn for validation, I will first duplicate my model in Scikit-Learn to make things a little easier.

# Import needed sklearn packages
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

# Initialize and run the regression model using sklearn
linreg = LinearRegression()
linreg.fit(mh_selected, mh_zips_encoded['price'])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
# Define a function to call cross_val_score for various k values
def crossval_it(model, features, target, k):
    
    """Find mean of MSE for k folds in cross-validation
    model: your linear model, preferably a LinearRegression() object
    features: a DataFrame of just your features
    target: a DataFrame of just your target
    k: number of folds for cross-validation
    Dependencies: numpy, sklearn.metrics.mean_squared_error, 
    sklearn.metrics.cross_val_score"""
    
    result = np.mean(cross_val_score(model, features, target, cv=k, 
                                      scoring='neg_mean_squared_error'))
    return result
# Define a function to compare k-fold cross-validation results at various
# values of k
def compare_k_folds(model, features, target, folds_list):
    
    """Print MSE for k-fold cross-validation with various k values
    model: your linear model, preferably a LinearRegression() object
    features: a DataFrame of just your features
    target: a DataFrame of just your target
    folds_list: a list of numbers of folds (k) to use in cross-
    validation
    Dependencies: numpy, sklearn.metrics.mean_squared_error, 
    sklearn.metrics.cross_val_score"""
    
    for fold in folds_list:
        mean_mse = crossval_it(model, features, target, fold)
        
        print('{} folds: {}'.format(fold, mean_mse))
# Call compare_k_folds for 5, 10, and 20 folds
folds = [5, 10, 20]

compare_k_folds(linreg, mh_selected, mh_zips_encoded['price'], folds)
5 folds: -5729117966.714569

10 folds: -5722862137.233953

20 folds: -5698285685.955608

The results for 5, 10, and 20-fold cross-validation above are mean squared errors. Taking the square root of each of these (root mean squared error or RMSE), I can see that, while my model doesn't predict price perfectly, at least it is pretty consistent.

# Define a function to compare RMSE for k-fold cross-validation at
# various k values
def compare_rmse(model, features, target, folds_list):
    
    """Compare RMSE for k-fold cross-validation results at various k values
    model: your linear model, preferably a LinearRegression() object
    features: a DataFrame of just your features
    target: a DataFrame of just your target
    folds_list: a list of numbers of folds (k) to use in cross-
    validation
    Dependencies: numpy, sklearn.metrics.mean_squared_error, 
    sklearn.metrics.cross_val_score"""
    
    for fold in folds_list:
        rmse = np.sqrt(abs(crossval_it(model, features, target, fold)))
        print('{} folds: {}'.format(fold, rmse))
# Call compare_rmse for 5, 10, and 20 folds
compare_rmse(linreg, mh_selected, mh_zips_encoded['price'], folds)
5 folds: 75690.9371504579

10 folds: 75649.60103816776

20 folds: 75486.99017682191

The benefit to looking at RMSE is that is it in the same units as the target; since our target is price and it hasn't been transformed in any way, RMSE can be expressed in dollars. As we cross-validate with more and more folds, RMSE stays pretty consistent, varying by only $204 from 5 to 20 folds. From this validation process I conclude that, although my model doesn't explain all the variation in price, it does its job consistently; its successes are not all random or lucky.

In the "Interpretation" section below, I will wrap up with a summary of how my model can help middle-income families understand the factors that influence the price of a home they want to buy or sell.

Interpretation

My purpose in analyzing the King County Housing dataset was to extract insights that would help middle-income families understand the factors influencing the prices of homes they want to buy or sell.

From the original dataset I eliminated the top 10% most expensive homes and then selected only homes with two to five bedrooms. This filtering left 18,999 observations for my analysis, with a median home price of $425,000, an affordable price for a family earning King County's median household income.

Examining a few features in isolation, I was able to observe the following:

  • The time of year may have an impact on sale price. Median sale prices in the months of April, May, June, and July tend to be higher than those at other times of year. For instance, April's median home price is $41,000 higher than January's.

  • The number of bedrooms may be a good indicator of sale price, and adding a bedroom could cause a substantial increase in the sale price of a home. For instance, adding a fourth bedroom to a three-bedroom house could cause up to a 22% increase in sale price.

  • Similarly, improving the grade of a house could have a substantial impact on price. For instance, raising a house's grade from 7 to 8 could raise the sale price by over $126,000.

It is important to note that the above observations are based on considering features in isolation, but in reality, the features in this dataset are probably related to one another in ways that affect the target (price). For instance, adding a bedroom probably means increasing a home's square footage at the same time, and if a new addition improves the design of the home, the grade may also increase.

My multiple linear regression model can offer more precise insights on the impacts of various features on price. I fit the model to a version of my dataset that used one-hot encoding on zipcodes, and I used stepwise feature selection to eliminate features that were not having a significant impact on price. The resulting model explains 82% of the variability in price.

Specifically, the model coefficients show that:

  • Adding a bathroom or increasing square footage by 100 square feet could raise the price of a home by over $8,300.

  • Improving the condition of a home by one level (e.g., from 3 to 4) could raise the price of the home by over $21,000.

  • Improving the grade of a home by one level (e.g., from 7 to 8) could raise the price of the home by over $42,000.

The dollar values cited here are likely to be much more accurate than those above because they take into account the effects of many features on price, not just one at a time.

There are many factors influencing the price of a home, not all of which are within a homeowner's control. There are also many potential factors affecting price that the original dataset did not consider, such as a home's school district, or the walkability of the neighborhood, or the distance to major employment centers. My model indicates, however, that there are some concrete steps a homeowner can take to raise the potential sale price of a home, including improvements to amenities and design.

Appendix

Project repository

This repository (https://github.com/jrkreiger/pricing-midrange-homes) contains files from a project assigned to me as part of Flatiron School's Online Data Science Bootcamp. The purpose of the project was to demonstrate skills in data cleaning and multiple linear regression modeling.

Here's what you'll find inside:

  • pricing-midrange-homes-king-co-wa.ipynb: Jupyter Notebook containing project code

  • column_names.md: Markdown file containing the data dictionary for the dataset provided by Flatiron School

  • kc_house_data.csv: CSV file containing the dataset provided by Flatiron School

  • presentation.pdf: PDF file containing slides from a non-technical presentation of project results

Technical specifications

I created the Jupyter Notebook in a virtual environment provided to me by Flatiron School. Things may work a little differently for you if you're running different versions of Python or the packages I used.

If you download the notebook to view locally, note that I formatted it using Jupyter Notebook extensions (https://github.com/ipython-contrib/jupyter_contrib_nbextensions). The extensions let me easily create a table of contents and collapse groups of cells to make the notebook easier to read and navigate. If you don't have the extensions installed locally, the notebook will look a little different for you.

Related content

You can read my blog post about the project at this link: http://jrkreiger.net/uncategorized/cleaning-house-data/.

View a video of my non-technical presentation here: https://youtu.be/yvaGRdNtPNI.

Authorship

Flatiron School provided me the dataset, the data dictionary, a Jupyter Notebook containing only a rubric for recording my identifying information, and the code I used for stepwise selection. I wrote the rest of the code in the Jupyter Notebook, the text of this README, my blog post linked above, and the content of my non-technical presentation (with the exception of some images whose creators I credit in the slides).