/Real-Estate-Prices-Prediction-California

Predicting Real Estate Price in California using Linear Regression model for non-linear features

Primary LanguageJupyter NotebookMIT LicenseMIT

Real Estate Price Prediction - California

Predicting Real Estate Price in California using Linear Regression model for non-linear features.

Objective

  • Build "Best Model" for predicting real estate prices in California.
  • Interpret the model using built in methods from scikitlearn, permutation_importance. Take a look at the user guide for permutation_importance here.

Building "Best Model" Strategy

  • Train / Test split the data and assign y = median_house_value
  • Calculate baseline predictions
  • Examine the Correlations
  • Measure the multicollinearity using builtin function VIF and drop features with high multicolinearity
  • Create the Transformers:
    • SimpleImputer: Imputing missing values as median for numerical and most_frequent for categorical columns.
    • PolynomialFeatures of varying degree in range 1 to 5
    • One-hot encoder for categorical feature "ocean_proximity"
    • KMeans clustering for longitude and latitude (10 clusters)
    • Pass through any remaining columns (remainder='passthrough')
  • Create the Pipeline with above mentioned Transformer and LinearRegression
  • Fit / Predict / calculate mses

Pipeline

image

Training and Testing MSEs vs Polynomial Degree

image

KMeans Clusters

image

Optimal Model Complexity using Simple Cross Validation

  • Result: The best degree polynomial Model is 3 with smallest mse = 3.9M
  • Plotting Model output / predictions vs actual values with degree = 3
image

Interpreting the Model using Permutation Importance

  • Conclusion:
    • Geographic location (longitude and latitude) of the housing units has a significant influence on the target variable
    • Population and total bedrooms also have relatively high permutation importance compared to the rest of the features
    • Median income has a moderate permutation importance
    • Housing Median Age, Households, Ocean Proximity, Total Rooms have relatively low permutation importance

Repository Structure

  • data/: Contains dataset used in the analysis.
  • notebooks/real-estate-price-prediction.ipynb: Jupyter notebook with code for data analysis.
  • README.md: Summary of findings and link to notebook

Notebook

The detailed analysis and code can be found in the Jupyter notebook here.