/CA-HOME-PRICE-PREDICTIONS---Final-Project

Create a platform that will predict a house price based on a user-input zip code and house type

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

HOME PRICE PREDICTIONS IN CALIFORNIA---Final Project


The PsyCat Learners

MEMBERS: James Milne | Katherine Matovic | Ellis Mok


California Home Price Prediction by zip code and house type


MAIN OBJECTIVES

  • To create a platform for avid real estate investors or potential homeowners looking to purchase their California dream home. By selecting their county and house type of interest, the platform is able to return an estimated house price for all the zip codes in the county selected by the user.
  • To utilize a database management system and a web development framework that will deploy data visualizations and machine learning model in the web so that platform is available to users
  • To create a dashboard with user-driven interaction. The main feature in this project is for users to be able to select the county name and house type (i.e. 1 bedroom, condo) then the adjacent map returns the home price prediction for all the zip codes in the county based on the user's selection.

DATA SOURCES

DATA VISUALIZATIONS

  • CA map with tooltip
  • Histogram groups data in bins and provides the quickest method to get an idea on the distribution of each attribute in a dataset. It features the center, spread, skewness of the data and can show the presence of outliers and data frequencies.
  • Bar charts that shows the average house prices by county

METHODOLOGY FOR THE MACHINE LEARNING ALGORITHM

Data Description

  • The real estate website Zillow has a variety of free smoothed, seasonally adjusted datasets of monthly home prices gathered from Multiple Listing Services (MLS's) and presented as an index by zip code for every state in the US stretching back to 1996.
  • We selected all California zip codes (1,746 zips/58 counties) with a monthly look-back of three years (2018-2021) for 1 bdrm - 5 bdrm, single family homes, and condos.

Data Transformations

  • Many zip codes in California in locales such as Orange County, the Bay Area or San Diego had average prices greater than $750,000. We scaled the data by dividing each observation by $1,000,000, to improve the model's speed and remove any impacts of higher vs lower priced homes. Once the model had trained, we reversed the effect of the scaling by multiplying the observations by $1,000,000.
  • Other transformations e.g. Standard Scaler or MinMax were tried but had no impact on model speed.

Assumptions

  • We note that machine learning models do not perform with blank data points. These can occur because some MLS's began reporting pricing data at different periods. Through an examination of the data, we found that the majority of the MLS's reported results approximately 3 years prior to the current period and that became the basis for our selection window.
  • To remove any remaining blanks, we assigned a blank cell a value equal to to value in the same zip code 12 months later.

Machine Learning Model Methodology

  • We divided the data into three parts: a training set comprising approximately 1,200 rows, a validation set comprising approximately 1-200 rows with the remainder comprising the test set. Our first model was a simple linear regression. This model was found to have a Mean Squared Error (MSE) of about 0.77.
  • The random seed value was set to 42 to make our model repeatable. Using an Long Short Term Memory model (LSTM) we added 2 hidden layers with 128 neurons, initially trained over a period of 20 epochs and used the default learning rate of 0.001. We note that we experimented with other layer combinations and ranges of epochs from 5 to 20 but settled on 6 to reduce the risk of over-fitting.

Results and final thoughts

  • Overall our model underestimated the values in each of the California counties by an average of about 7% YoY. This was in contrast to the overall upward trend of the pricing index which would have projected an upward trend of anywhere from 5-10% year over year. The overall MSE on our testing datasets averaged about 0.08 (unscaled or about $80,000, scaled). The overall loss function averaged 0.10 for the validation data set, but around 0.33 for the training data set (unscaled), indicating our training model fit fairly well but could still be improved.
  • The reason for our model shortcomings likely stems from the difficulty the model faced in projecting a large number of varying time series over each zip code, rather than a large number of observations from a single time series (as would be the case for a financial time series. Perhaps with more data points in future, or additional features, we may be able to resolve these issues.
  • LIBRARIES & TOOLS

    • Python/Pandas
    • Colaboratory
    • Flask App
    • SQL
    • Tableau
    • JavaScript
    • Leaflet
    • HTML
    • CSS
    • Bootstrap
    • Jupyter