by: Paige Rackley
[Project Description] [Project Planning] [Data Dictionary] [Data Acquire and Prep] [Data Exploration] [Modeling] [Conclusion]
The goal of this project is to identify drivers of error looking at logerror. The reason we want to be able to predict logerror to to better improve our current models. Improving our models helps better serve our customers at Zillow because we can give them the data they request without error. We want our customers to trust us and the best way to get them to do that is to be able to give them info that is accurate.
- Use clustering algorithms to help determine predictors of logerror to help improve the performance of our current model.
- Using drivers of logerror to help imrpove our model of property values.
- Improve understanding of logerror to better inform the use of models for property prediction.
- My audience is the Zillow Data Science team.
- A final report notebook to be walked through during presentation.
- Notebooks used while working through data.
- Modules used during project to be used to replicate.
Our goal was to find drivers, find clusters, and test them to see if there was any strong relationship. With those, we tested to see if we can beat our baseline model. We did not get to beat the baseline.
- There is a relationship between yearbuilt and logerror.
- There is a relationship between county and logerror, but not specifically one county.
- There is a relationship between bedroomcnt and logerror.
- There was not a strong relationship between the clusters created.
Target | Datatype | Definition |
---|---|---|
logerror | float64 | Log Error |
Feature | Datatype | Definition |
---|---|---|
bedroomcnt | float64 | number of bedrooms |
bathroomcnt | float64 | number of bathrooms |
scalculatedfinishedsquarefeet | float64 | total square feet of home |
county | object | county/zipcode |
latitude | float64 | latitude of home |
longitude | float64 | longitude of home |
lotsizesquarefeet | float64 | Square feet of lot |
propertylandusetypeid | float64 | Property Land Use type ID |
rawcensustractandblock | float64 | Raw Census |
regionidcounty | float64 | Region ID for county |
regionidzip | float64 | Region ID for zipcode |
yearbuilt | float64 | Year home was built |
taxvaluedollarcnt | float64 | Tax Vallue total |
assessmentyear | float64 | Assessment Year |
In this step, I used SQL queries to pull what I wanted from Zillows tables.
In this step, I created multiple functions that were meant to help me prepare my data for both exploration and modeling.
handle_missing_values: How to handle missing values based on minimum percentage of values for rows and columns
wrangle_zillow: The wrangle function has the acquire and handle_missing_values nested in it. This function is to explore on independent variables, which will help us decide what to use for clustering later.
Steps implemented:
- Get rid of null values in my columns (lose a lot of bulk, nearly no data loss) and redundant columns.
- For the 'fips' column I both encode the zip codes to the appropriate countys (Los Angeles, Ventura, Orange County) and rename the column to 'County' for readability.
- Removed outliers to many columns:
- Bathroom and bedroom count range to 1 - 5
- Logerror range to 0.5 to -0.31
- Year built houses older than 1910
- calculatefinishedsquarefeet range to 650 - 5500
- taxvalluedollarcnt range to 40000 - 300000
split: This function splits the data into the 3 sets needed for exploring and statistical tests. I stratify on 'county' in this step.
scale_data: This function scales the the 3 split data sets.
wrangle_split_scale: This function combines everything in to one. We will do our clustering, testing, and modeling here.
The Big Questions: Can clustering help us predict logerror? Can clustering help us beat the baseline?
Our target variable is logerror, so we will be comparing it to individual features as well as combinations of features (clusters).
For this Zillow project, since we would be using clustering, I wanted to focus on the major key features we have to work with and cluster features that are similar. I came up with three major themes:
- Land - refers to the house itself. The size, year it was built, how many rooms, etc.
- Location - refers to the geological location of the home.
- Tax - refers to the taxes paid on the home.
- All 3 clusters after testing failed to reject null hypothesis.
- All 3 didn't have any significant results.
- Maybe trying smaller clusters or using different features for further exploration.
Final Results :
| model | RMSE_train | RMSE_validate
baseline_mean | 0.088933 | 0.088828
baseline_median | 0.089109 | 0.088993
linear regression | 0.088677 | 0.088951
LassoLars regression | 0.088933 | 0.088833
Polynomial regression | 1.563197 | 1.739704
RMSE for Polynomial Model, degrees=2 Test: 1.7995147682943546
- Polynomial Regression model does better than baselines and LassoLars and Linear regression by a lot.
- Using the unstructured ML method of cluster models does not show to be the best model method when it comes to determining logerror predictions.
- yearbuilt may be an indicator of logerror; however, this requires more investigation.
- Most important takeaway is that more time is needed to explore the data.
- Although some Key drivers were found, and while they do have a relationship with logerror, many do not have a strong positive relationship with logerror.
- logerror outliers would be beneficial to focus on.
- I would like to try classification models on the data. This may or may not beat the baseline model, but it could still bring in new takeaways.
- I would recommend to continue improving upon the baseline model as it works well enough given the current situation.
- I would recommend pursuing further identifications of key drivers for logerror to potentially construct better accurate predictors.
- Consider creating models for each cluster, this may help narrow down other variables.
How to Reproduce
- Read this README.md
- Download modules into your working directory.
- Create a .gitignore for env.py since it contains confidential info such as username and password to access SQL databases.
- Have fun doing your own exploring, modeling, and more!