I as a junior data scientest was given the task to find logerror in the 2017 housing data. I first asked a question about logerror, what is logerror?
I came up with the formula >logerror
= log (Zestimate) − log (ActualSalePrice), and off to the data I went, I followed my mvp steps and steps to reproduce and spent the last 5 days digging and cleaning to find any new information I could find I created a model that perfomed %0.0003 improvment over base line. I just continued to think and work until my deadline approched.
Events in sequence
- Import , from the codeup sql and use your log information.
- Acquire Data
- Clean, Prep and Split Data
- Explore Data
- Hypothesis Testing
- Evaluation of Data
- Modeling
- Mvp, Identify Baseline
- Train and Validate
- We Test our best
- Conclusion and Recomindations
-LA'- one of the dummy files I created for county
-Orange - one of the dummy files I created for county
-Ventura - one of the dummy files I created for county
-fips - The tax codes for the county
-latitude - map cordinates for the countys
-longitude'- map cordinates for the countys
-sqft - house square feet
-lot_sqft- area around the house sqft
-zip_code- the zipcode for the houses in the counties
-property_quality'- how the house holds up in terms of quality
-home_age' - How long the house has been since it has been built
-logerror' - logerror
= log (Zestimate) − log (ActualSalePrice)
-structure_value'- the actual home or structure value
-bedrooms - the number on how many bedrooms there are
-bathrooms'- the number on how many bathrooms there are
-land_value - the value of the land in a dollar amount
-structure_dollar_per_sqft'- the mean cost of how much a house is worth per sqft
-land_dollar_per_sqft'- the mean cost of how much the area around the house cost
-bed_bath_ratio', bed and bath ratio that is used with outliers removed
-avgqualityavgage', - a home of avrage quality
-poor_quality_old_age', a poor quality home
-avq_quality_young_age', a avg quality home, but a young age life
-avg_quality_old_age', a ave quality home, but old age life
-bestest' - the bestested for my clusters and age specimen
Executive Summary:
Project Goals- To identify drivers of error in the Zestimate in order to improve accuracy of predicting home values, with the help of Ml and clustering models.
logerror
= log (Zestimate) − log (ActualSalePrice)
In this presention I will attack and perform the heavy proccess of Cluster analysis on the logerror values from the year of 2017, to predict future homeprices. I will also be searching for the key drives of logerror, This turned out to be
- 'sqft',
-'lot_sqft',
- 'bedrooms',
- 'bathrooms',
- 'structure_dollar_per_sqft',
- 'land_dollar_per_sqft',
- 'poor_quality_old_age',
- 'avq_quality_young_age'
-'longitude'
I created a ols regressor model with a %0.0003 effective improvement over my baseline so I as a data scientist would recommend further analysis with my model.
BASELINE:
RMSE using Median
Train/In-Sample: 0.164122
Validate/Out-of-Sample: 0.166928
RMSE for OLS using LinearRegression
Test/Out-of-Sample Performance: 0.161775
FIPS
- Los Angeles County, California (6037)
- Orange County, California (6059)
- Ventura County, California (6111)
Hypotheses
1.Fail to reject the null hypothesis // home_age and logerror.
There is a linear relationship.
Although, it is a negative weak one.
2.Reject null statment: No correlation between lot_sqft and logerror.
There is a linear relationship.
Although, it is a positive weak one.
3.Fail to reject the null hypothesis // No correlation between home_value and logerror.
-LA: 0.014516765820273388
-Orange: 0.01786707488534417
-Ventura: 0.013923148212340804
- All three counties rejected the null hypothiesis
logerror
= log (Zestimate) − log (ActualSalePrice)
RMSE using Median
Train/In-Sample: 0.164122
Validate/Out-of-Sample: 0.166928
RMSE for OLS using LinearRegression
Test/Out-of-Sample Performance: 0.161775