Project: Single family residence price predictions.
Build the machine learning model that can predict single family residence prices based on the data from 2017.
- Clone this repo into your computer.
- Acquire the data from databaase using your
env.py
file - Put the data in the file containing the cloned repo.
- Run the
zillow_project.ipynb
file.
My initial hypothesis is that main price predictors going to be number of bathrooms and bedrooms.
- Acquire the data from the
zillow
database. Transform the data to a Pandas dataframe to make it easy to use and manipulate in the Jupyter Notebook. - Prepare the data for exploration and analysis. Find out if there are some values missing and find a way to handle those missing values.
- Change the data types if needed
- Find if there are features that can be created to simplify the exploration process.
- Handle the outliers.
- Create a data dictionary.
- Split the data into 3 data sets: train, validate and test data (56%, 24%, and 20% respectively)
- Explore the train data set through visualizations and statistical tests.
- Find which features have an impact on the house prices.
- Make the exploration summary and document the main takeaways.
- Impute the missing values if needed.
- Pick the features that can help to build a good prediction model.
- Identify if new features have to be created.
- Encode the categorical variables
- Split the target variable from the data sets.
- Scale the data prior to modeling.
- Pick the regression algorithms for creating the prediction model.
- Create the models and evaluate regressors using the RMSE score on the train data set.
- Pick five of the best performing models based on the RMSE score and evaluate them on the validation set.
- Find out which model has the best performance: relatively high predicting power on the validation set and slight difference in the train and validation prediction results.
- Make predictions for the test data set.
- Evaluate the results.
Drow conclusions
It was impossible to remove all outliers, it would decrease the data size dramatically. Two columns lot_sqft
and home_value
still contain lots of them. On top of this home_value
contains some not realistic data(like home price being below $50K). This fact might negatively affect the model's performance.
- The mean price is more than $$$80K higher that the median price
- The most common house prices are between $$$50K and $100
- There is a significant difference in the house prices among counties. Houses in Orange county have the highest prices, while prices in Los Angeles are below the median.
- Houses with a pool are more expensive. Most of them have a price above the median.
- The most expensive houses without a pool are in Orange county and with a pool in Ventura county.
- There is a positive correlation between square footage and price.
- Ventura county have the strogest sq.footage / price relations.
- There is no correlation between the house age and its price in LA county while other counties have a strong negative correlation.
- Gradient Boosting Regressor performed the best with the whole data set and with the Ventura county data.
- Gradient Boosting Regressor is a good model in terms of prediction but doesn't return stable results. The RMSE scores vary a lot in all 3 sets.
- For stable results I would pick Random Forest Regressor or Lasso Lars Regressor.
- Overall my regression model performs good. Its predictions beat the baseline model by 23.5%
- The model would perform even better if the data from LA county contained stronger relation between features and price.
- To imporove prediction results I would recommend to pull more features from the database and look for the ones that have a strong correlation with the price in LA county.