classification-project

The repository contains the files for Daniel Northcutt and Jared Godar's Codeup project on clustering and modeling of zillow real estate data to predict error estimates

About the Project

Project Goals

The main goal of this project is to be able to accurately predict the Zillow Zestimate error.

This will be accomplished by using past using property data from transactions of single family homes in 2017 and clustering of data to build at least four models, evaluating the effectiveness of each model, and testing the best model on new data is has never seen.

The ability to accurately value a home is essential for both buyers and sellers. The ability of us to predict error in zestimates will allow us to determine the major drivers of error then improve our estimates accordingly. Having the most accurate estimates possible is at the core of our business.

Project Description

This project provides the opportunity to create and evaluate multiple predictive models as well as implement other essential parts of the data science pipeline including data cleaning, imputing nulls, and clustering data to look for trends in subgroups.

It will involve pulling relevant data from a SQL database; cleaning that data; splitting the data into training, validation, and test sets; scaling data; feature engineering; exploratory data analysis; clustering; modeling; model evaluation; model testing; and effectively communicating findings in written and oral formats.

A home is often the most expensive purchase one makes in their lifetime. Having a good handle on pricing is essential for both buyers and sellers. An accurate pricing model factoring in the properties of similar homes will allow for appropriate prices to be set as well as the ability to identify under and overvalued homes. By determining drivers of error in our models, we can develop strategies to improve the models.

Initial Questions

What are the main drivers of estimate error?
What are the relative importance of the assorted drivers?
What factors reduce error?
What factors don't matter?
Are there any other potentially useful features that can be engineered from the current data available?
Are the relationships suggested by initial visualizations statistically significant?
Is the data balanced or unbalanced?
Are there null values or missing data that must be addressed?
Are there any duplicates in the dataset?
Which model feature is most important for this data and business case?
Which model evaluation metrics are most sensitive to this primary feature?

Data Dictionary

Feature	Datatype	Description
bathroomcnt	float64	bathroom count
bedroomcnt	float64	bedroom count
calculatedfinishedsquarefeet	float64	calc finished square feet
fips	float64	fips (county)
latitude	float64	latitude
longitude	float64	longitude
lotsizesquarefeet	float64	lot size square feet
regionidcity	float64	city region
regionidcounty	float64	city county
regionidzip	float64	city zip
yearbuilt	float64	year built
structuretaxvaluedollarcnt	float64	structure tax value count
taxvaluedollarcnt	float64	tax value count
landtaxvaluedollarcnt	float64	land value count
taxamount	float64	tax amount
county	object	county name
age	float64	age of home
age_bin	float64	age bins of homes
taxrate	float64	tax rate
acres	float64	acres
acres_bin	float64	acre bin of homes
sqft_bin	float64	squarefoot bin of homes
structure_dollar_per_sqft	float64	structure dollar per sqft
structure_dollar_sqft_bin	float64	structure dollar per sqft bins
land_dollar_per_sqft	float64	land dollar per sqft
lot_dollar_sqft_bin	float64	lot dollar per sqft
bath_bed_ratio	float64	bath to bed ratio
cola	int64	cola
logerror_bins	category	bin of logerror
baseline	float64	baseline target to beat
scaled_latitude	float64	scaled latitude
scaled_longitude	float64	scaled longitude
scaled_bathroomcnt	float64	scaled bathroom count
scaled_taxrate	float64	scaled tax rate
scaled_bedroomcnt	float64	scaled bedroom count
scaled_lotsizesquarefeet	float64	scaled lot size sqft
scaled_age	float64	scaled age of homes
scaled_acres	float64	scaled acres
scaled_bath_bed_ratio	float64	scaled bath bed ratio
scaled_calculatedfinishedsquarefeet	float64	scaled calc sqft
area_cluster	object	area cluster grouping
size_cluster	object	size cluster grouping
price_cluster	object	price cluster grouping
tax_cluster	object	tax cluster grouping
area_cluster_la_newer	uint8	la newer area cluster
area_cluster_la_older	uint8	la older area cluster
area_cluster_northwest_costal	uint8	northwest costal area cluster
area_cluster_palmdale_landcaster	uint8	palmdale-landcaster area cluster
area_cluster_santa_clarita	uint8	santa clarita area cluster
area_cluster_se_coast	uint8	se coast area cluster
size_cluster_1250_to_1650	uint8	size cluster
size_cluster_1300_to_2000	uint8	size cluster
size_cluster_1500_to_1900	uint8	size cluster
size_cluster_1500_to_2800	uint8	size cluster
size_cluster_2300_to_4400	uint8	size cluster
size_cluster_2900_to_4000	uint8	size cluster
size_cluster_900_to_1200	uint8	size cluster
price_cluster_144000_to_355000	uint8	price cluster
price_cluster_34000_to_110000	uint8	price cluster
price_cluster_420000_to_870000	uint8	price cluster
price_cluster_45000_to_173000	uint8	price cluster
price_cluster_69000_to_210000	uint8	price cluster
tax_cluster_1000_to_3000	uint8	tax cluster
tax_cluster_16000_to_22000	uint8	tax cluster
tax_cluster_30000_to_40000	uint8	tax cluster
tax_cluster_5000_to_6000	uint8	tax cluster
tax_cluster_8500_to_12000	uint8	tax cluster
logerror	float64	log error - target variable

Steps to Reproduce

You will need your own env file with database credentials along with all the necessary files listed below to run my final project notebook.

Read this README.md.
Download the wrangle4.py, explore.py, and modelipynb files into your working directory.
Add your own env file to your directory. (user, password, host).
Run the report_notebook2.ipynb workbook.

The Plan

Acquire, clean, prepare, and split the data:
- Pull from Zillow database.
- Eliminate any unnecessary or redundant fields.
- Engineer new, potentially informative features.
- Search for null values and respond appropriately (delete, impute, etc.).
- Deal with outliers.
- Scale data appropriately.
- Divide the data in to training, validation, and testing sets (~50-30-20 splits)
Exploratory data analysis:
- Visualize pairwise relationships looking for correlation with home value.
- Note any interesting correlations or other findings.
- Test presumptive relationships for statistical significance.
- Think of what features would be most useful for model.
- Employ clustering to look for relationships between specific sub-groups.
- Record any other interesting observations or findings. NOTE: This data analysis will be limited to the training dataset
Model generation, assessment, and optimization:
- Establish baseline performance (mean model error, assuming error is normally distributed).
- Generate a basic regression model using only strongest drivers.
- Calculate evaluation metrics to assess quality of models (RMSE, R^2, and p as primary metrics).
- Generate additional models incorporating other existing fields.
- Use k-best and recursive feature selection to determine features.
- Engineer additional features to use in other models.
- Evaluate ensemble of better models on validation data to look for overfitting.
- Select the highest performing model.
- Test that model with the previously unused and unseen test data once and only once.
Streamline presentation
- Take only the most relative information from the working along and create a succinct report that walks through the rationale, steps, code, and observations for the entire data science pipeline of acquiring, cleaning, preparing, modeling, evaluating, and testing our model.
- Run the explore.py code to run data exploratory analysis of the data thru statistic tests and clustering visualizations
- Run the model.py to run the 5 different OLS models for the train and validate datasets followed by running on the testing dataset
** Next Steps/Recommendations:
- Thoughtfully impute missing values from the dataset to give a richer training set to explore
- Modeling can be tested on other linear regressors (LassoLars, TweedieRegressor, and Poylnomial regression tuning hyperparameters)
- Employ cross-validation techniques while tunning additional models

Key Findings

Most important factors:
- By incorporating cluster data, our model improved baseline performance nearly 3%.
- Clustering provided a greater understanding of our data and allowed modeling that beat the baseline.
- Features showed very low correlation with the target variable.
- Using scaling, binning, and clustering gave the data a foundation for a stronger model.
Model performance
- Model 5 of 22 features (geoclusters, size clusters, price clusters, taxvaluedollarcnt,
  structuretaxvaluedollarcnt, landtaxvaluedollarcnt, taxamount) was the strongest performing model
- Tested dataset performed 3% better than the baseline RMSE
- All models performed were OLS

Zillow-Project-JG-DN/clustering-project-jgdn