For my final project, I analyzed Zillow home sales data for Mound, MN over the past two years, specially focusing on 3 bedroom, 2+ bathroom houses.
I have a rental house in Mound, MN that I’m planning to put on the market in June and I’m interested to see if the past home sales data will reflect any interesting trends or help me predict the final sale price of my house.
I started by manually curating a data frame (‘moundsales’) from Zillow (Data Source).
The data frame includes 150 instances and 12 variables: ‘ID’, ‘Address’, ‘Beds’, ‘Baths’, ‘SQFT’, ‘Month.Sold’, ‘Year.Sold’, ‘Sale.Price’, ‘Lake.Front’, ‘Year.Built’, ‘Lot.Size.Sqft’, ‘Most.Recent.Tax.Assessment’.
I conducted exploratory data analysis of the data frame, created new data subsets, summarized data via histogram, scatter and box plots, determined probability, determined mean/standard deviation/z-score, and conducted regression.
According to Zillow, the were 150 three bedroom, two-plus bathroom houses sold in Mound, MN between April 2020 and April 2022. Mound is a community on Lake Minnetonka, therefore 32 of the 150 houses are lake front properties. This is an important distinction because these houses sold for significantly more than non-lake front houses, usually in the million dollar range.
My house was built in 1948, has 1859 finished sqft and includes three bedrooms and two baths. The house sits on a 10,000 sqft, non-lake front, lot and was assessed at $245,000 in 2021.
I’d like to understand the relationship between sale price and the most recent tax assessment data and determine the probability of my house selling for over $300,000.
For this observational study, I created a data frame labeled ‘moundsales’. In addition, I created a subset labeled ‘nonlf’ that only includes non-lake front houses because including lake front houses skews the sale price and tax assessment data.
moundsales <- read.table("moundsales.csv", sep = ",", header = TRUE)
nonlf <- subset(moundsales, moundsales$Lake.Front == 'N')
I conducted a number of various data exploration activities in R (e.g., dim, length, names, summary, head, tail, etc.) to anlayze both the ‘moundsales’ and ‘nonlf’ data frames. See Appendix for more details.
The mean sale price for lake front houses was $888,192, while the mean sale price for non-lake front houses is $324,801.
lakefront <- subset(moundsales, moundsales$Lake.Front == 'Y')
summary(lakefront)
## ID Address
## Min. : 2.00 1579 Bluebird Ln : 1
## 1st Qu.: 38.50 1721 Eagle Ln : 1
## Median : 88.00 1778 Shorewood Ln: 1
## Mean : 78.38 1791 Resthaven Ln: 1
## 3rd Qu.:119.75 1969 Lakeside Ln : 1
## Max. :150.00 2033 Arbor Ln : 1
## (Other) :26
## Beds Baths SQFT
## Min. :3 Min. :2.000 Min. :1302
## 1st Qu.:3 1st Qu.:2.000 1st Qu.:1718
## Median :3 Median :3.000 Median :2239
## Mean :3 Mean :3.078 Mean :2316
## 3rd Qu.:3 3rd Qu.:4.000 3rd Qu.:2552
## Max. :3 Max. :5.000 Max. :4153
##
## Month.Sold Year.Sold Sale.Price
## Jul :8 Min. :2020 Min. : 445000
## Dec :5 1st Qu.:2020 1st Qu.: 700000
## Apr :4 Median :2020 Median : 850000
## Aug :4 Mean :2021 Mean : 888192
## Mar :3 3rd Qu.:2021 3rd Qu.:1035000
## Nov :2 Max. :2022 Max. :1670000
## (Other):6
## Lake.Front Year.Built Lot.Size.Sqft
## N: 0 Min. :1910 Min. : 4356
## Y:32 1st Qu.:1946 1st Qu.: 7840
## Median :1972 Median : 9583
## Mean :1968 Mean :10911
## 3rd Qu.:1990 3rd Qu.:12306
## Max. :2020 Max. :28750
##
## Most.Recent.Tax.Assessment
## Min. : 411000
## 1st Qu.: 637750
## Median : 705000
## Mean : 749812
## 3rd Qu.: 855750
## Max. :1344000
##
boxplot(moundsales$Sale.Price ~ moundsales$Lake.Front)
Only 46 houses, or 39%, sold for under $300,000, which supports my realator’s claim that there has been a shortage of houses under $300,000 for sale in Mound, MN.
under300k <- subset(nonlf, nonlf$Sale.Price < 300000)
dim(under300k)
## [1] 46 12
Of the 46 houses that sold for more than $300,000, 16 had a recent tax assessment of $245,000 or below.
S300kT245 <- subset(nonlf, nonlf$Sale.Price > 300000 & nonlf$Most.Recent.Tax.Assessment < 246000)
dim(S300kT245)
## [1] 16 12
The probability of a three bedroom, two-plus bathroom house in Mound, MN selling for $300,000 or more given it is not lake front is 61%.
Over299k <- subset(nonlf, nonlf$Sale.Price > 299999)
72/118
## [1] 0.6101695
The probability of a three bedroom, two-plus bathroom house in Mound, MN selling for $1 million or more given it has lake front is 28%.
table(moundsales$Lake.Front)
##
## N Y
## 118 32
milsale <- subset(moundsales, moundsales$Sale.Price > 999999)
View(milsale)
9/32
## [1] 0.28125
The sale price distribution for the ‘nonlf’ data set is unimodal and normal with a slight skew to the right. The distribution has the following statistics:
mean = 324801 sd = 74870 x = 300000 Z-score = - .331
There is a 63% probability that my house will sell for over $300k.
format(nonlf$Sale.Price, scientific = FALSE)
## [1] "341000" "346000" "330000" "315000" "322000"
## [6] "318000" "360000" "427900" "315000" "535000"
## [11] "299251" "400000" "273000" "280000" "350000"
## [16] "118400" "290000" "302000" "335000" "302000"
## [21] "490000" "300000" "386000" "310000" "250000"
## [26] "430000" "370000" "387500" "365000" "295000"
## [31] "385000" "320000" "292000" "452061" "425000"
## [36] "395000" "287000" "582500" "425000" "330000"
## [41] "232500" "375000" "310000" "300000" "480000"
## [46] "305000" "518500" "350000" "342500" "355000"
## [51] "333000" "354000" "320000" "420000" "396000"
## [56] "340000" "339900" "290000" "271900" "320000"
## [61] "312999" "215000" "281500" "355000" "370000"
## [66] "325000" "435000" "305000" "360000" "289500"
## [71] "449051" "280000" "325000" "240000" "235000"
## [76] "400000" "390000" "300000" "250000" "465000"
## [81] "237000" "279000" "177000" "265000" "271000"
## [86] "350000" "270000" "320678" "270000" "272500"
## [91] "275000" "268000" "339000" "200000" "262000"
## [96] "281685" "315000" "350000" "278900" "275000"
## [101] "435000" "225000" "319000" "236400" "335000"
## [106] "245000" "265000" "430000" "270000" "226600"
## [111] "295000" "240000" "275000" "289900" "228000"
## [116] "330000" "329900" "222000"
hist(nonlf$Sale.Price)
sd(nonlf$Sale.Price)
## [1] 74870.19
Given:
mean = 324801 sd = 74870 x = 300000
(300000 - 324801)/74870
## [1] -0.3312542
Answer: z-score = - .331
1 - pnorm(-.331)
## [1] 0.6296778
Answer: 63%
According to the regression data below, there is a very high correlation between the ‘Sale.Price’ and ‘Most.Recent.Tax.Assessment’ data. The intercept value indicates that one could expect a house to sell for a premium of $20,934 over the most recent tax assessment for the house.
plot(moundsales$Sale.Price ~ moundsales$Most.Recent.Tax.Assessment)
lm(moundsales$Sale.Price ~ moundsales$Most.Recent.Tax.Assessment)
##
## Call:
## lm(formula = moundsales$Sale.Price ~ moundsales$Most.Recent.Tax.Assessment)
##
## Coefficients:
## (Intercept)
## 20934.465
## moundsales$Most.Recent.Tax.Assessment
## 1.131
summary(lm(moundsales$Sale.Price ~ moundsales$Most.Recent.Tax.Assessment))
##
## Call:
## lm(formula = moundsales$Sale.Price ~ moundsales$Most.Recent.Tax.Assessment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -305751 -44941 -16431 30781 510236
##
## Coefficients:
## Estimate
## (Intercept) 2.093e+04
## moundsales$Most.Recent.Tax.Assessment 1.131e+00
## Std. Error
## (Intercept) 1.560e+04
## moundsales$Most.Recent.Tax.Assessment 3.580e-02
## t value
## (Intercept) 1.342
## moundsales$Most.Recent.Tax.Assessment 31.596
## Pr(>|t|)
## (Intercept) 0.182
## moundsales$Most.Recent.Tax.Assessment <2e-16
##
## (Intercept)
## moundsales$Most.Recent.Tax.Assessment ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 97490 on 148 degrees of freedom
## Multiple R-squared: 0.8709, Adjusted R-squared: 0.87
## F-statistic: 998.3 on 1 and 148 DF, p-value: < 2.2e-16
In conclusion, I determined that there is a very high correlation between sale price and the most recent tax assessment for 3 bedroom, 2+ bath houses sold in Mound, MN in the past two years. During this time, 72 non-lake front houses were sold for $300,000 or more and there is a 63% probability that my house will sell for $300,000 or more. As a side note, I also learned that if my property were to have lake front, there would be a 28% probability of selling my property for $1 million or more!
moundsales <- read.table("moundsales.csv", sep = ",", header = TRUE)
dim(moundsales)
## [1] 150 12
length(dim(moundsales))
## [1] 2
names(moundsales)
## [1] "ID"
## [2] "Address"
## [3] "Beds"
## [4] "Baths"
## [5] "SQFT"
## [6] "Month.Sold"
## [7] "Year.Sold"
## [8] "Sale.Price"
## [9] "Lake.Front"
## [10] "Year.Built"
## [11] "Lot.Size.Sqft"
## [12] "Most.Recent.Tax.Assessment"
head(moundsales)
## ID Address Beds Baths SQFT
## 1 1 4674 Cumberland Rd 3 2.0 2369
## 2 2 6641 Halstead Ave 3 3.0 2092
## 3 3 4767 Richmond Rd 3 2.0 1500
## 4 4 4515 Manchester Rd 3 2.0 1931
## 5 5 6117 Beachwood Rd 3 2.5 2400
## 6 6 4812 Lanark Rd 3 2.0 2010
## Month.Sold Year.Sold Sale.Price Lake.Front
## 1 Apr 2022 341000 N
## 2 Apr 2022 980000 Y
## 3 Apr 2022 346000 N
## 4 Mar 2022 330000 N
## 5 Mar 2022 315000 N
## 6 Mar 2022 322000 N
## Year.Built Lot.Size.Sqft
## 1 1984 6534
## 2 1980 10018
## 3 1981 5662
## 4 1920 8276
## 5 1970 12632
## 6 1987 9583
## Most.Recent.Tax.Assessment
## 1 290000
## 2 631000
## 3 207000
## 4 240000
## 5 282000
## 6 256000
tail(moundsales)
## ID Address Beds Baths SQFT
## 145 145 5736 Lynwood Blvd 3 2 1939
## 146 146 4714 Hanover Rd 3 2 1630
## 147 147 4959 Leslie Rd 3 3 2729
## 148 148 5447 Breezy Rd 3 2 2047
## 149 149 2740 Grove Ln 3 2 1426
## 150 150 3201 Charles Ln 3 4 2475
## Month.Sold Year.Sold Sale.Price Lake.Front
## 145 Apr 2020 289900 N
## 146 Apr 2020 228000 N
## 147 Apr 2020 330000 N
## 148 Apr 2020 329900 N
## 149 Apr 2020 222000 N
## 150 Apr 2020 1100000 Y
## Year.Built Lot.Size.Sqft
## 145 1910 16553
## 146 1972 6534
## 147 1965 17860
## 148 1946 10018
## 149 1984 10055
## 150 1988 13068
## Most.Recent.Tax.Assessment
## 145 273000
## 146 234000
## 147 334000
## 148 312000
## 149 222000
## 150 1001000
summary(moundsales$Sale.Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 118400 287625 337000 444991 451308 1670000
summary(moundsales$Most.Recent.Tax.Assessment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22000 241250 278500 374900 399750 1344000
table(moundsales$Lake.Front)
##
## N Y
## 118 32
milsale <- subset(moundsales, moundsales$Sale.Price > 999999)
View(milsale)
lakefront <- table(moundsales$Lake.Front)
barplot(lakefront)
nonlf <- subset(moundsales, moundsales$Lake.Front == 'N')
summary(nonlf$Sale.Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 118400 275000 316500 324801 358750 582500
summary(nonlf$Most.Recent.Tax.Assessment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22000 237250 259000 273229 307750 464000
summary(nonlf$SQFT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1117 1542 1820 1876 2154 3566
summary(nonlf$Lot.Size.Sqft)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3484 6534 9583 10029 11761 23522
summary(nonlf$Year.Built)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1910 1954 1974 1969 1987 2020
under300k <- subset(nonlf, nonlf$Sale.Price < 300000)
dim(under300k)
## [1] 46 12