Real Estate Price Prediction - California

Predicting Real Estate Price in California using Linear Regression model for non-linear features.

Build "Best Model" for predicting real estate prices in California.
Interpret the model using built in methods from scikitlearn, permutation_importance. Take a look at the user guide for permutation_importance here.

Train / Test split the data and assign y = median_house_value
Calculate baseline predictions
Examine the Correlations
Measure the multicollinearity using builtin function VIF and drop features with high multicolinearity
Create the Transformers:
- SimpleImputer: Imputing missing values as median for numerical and most_frequent for categorical columns.
- PolynomialFeatures of varying degree in range 1 to 5
- One-hot encoder for categorical feature "ocean_proximity"
- KMeans clustering for longitude and latitude (10 clusters)
- Pass through any remaining columns (remainder='passthrough')
Create the Pipeline with above mentioned Transformer and LinearRegression
Fit / Predict / calculate mses

Conclusion:
- Geographic location (longitude and latitude) of the housing units has a significant influence on the target variable
- Population and total bedrooms also have relatively high permutation importance compared to the rest of the features
- Median income has a moderate permutation importance
- Housing Median Age, Households, Ocean Proximity, Total Rooms have relatively low permutation importance

Repository Structure

data/: Contains dataset used in the analysis.
notebooks/real-estate-price-prediction.ipynb: Jupyter notebook with code for data analysis.
README.md: Summary of findings and link to notebook

The detailed analysis and code can be found in the Jupyter notebook here.