Predicting Real Estate Price in California using Linear Regression model for non-linear features.
- Build "Best Model" for predicting real estate prices in California.
- Interpret the model using built in methods from scikitlearn, permutation_importance. Take a look at the user guide for permutation_importance here.
- Train / Test split the data and assign y = median_house_value
- Calculate baseline predictions
- Examine the Correlations
- Measure the multicollinearity using builtin function VIF and drop features with high multicolinearity
- Create the Transformers:
- SimpleImputer: Imputing missing values as median for numerical and most_frequent for categorical columns.
- PolynomialFeatures of varying degree in range 1 to 5
- One-hot encoder for categorical feature "ocean_proximity"
- KMeans clustering for longitude and latitude (10 clusters)
- Pass through any remaining columns (remainder='passthrough')
- Create the Pipeline with above mentioned Transformer and LinearRegression
- Fit / Predict / calculate mses
- Result: The best degree polynomial Model is 3 with smallest mse = 3.9M
- Plotting Model output / predictions vs actual values with degree = 3
- Conclusion:
- Geographic location (longitude and latitude) of the housing units has a significant influence on the target variable
- Population and total bedrooms also have relatively high permutation importance compared to the rest of the features
- Median income has a moderate permutation importance
- Housing Median Age, Households, Ocean Proximity, Total Rooms have relatively low permutation importance
data/
: Contains dataset used in the analysis.notebooks/real-estate-price-prediction.ipynb
: Jupyter notebook with code for data analysis.README.md
: Summary of findings and link to notebook
The detailed analysis and code can be found in the Jupyter notebook here.