/Boston-House-Price-Prediction_case_study

A case study on the prediction modelling of the house price of Boston using various machine learning techniques

Primary LanguageHTML

Boston House Prices Prediction: A case study

Solo project: by Anurima Dey

Purpose:

I started my journey of understanding the machine learning algorithms using this project. In the confusion of which topic to choose, "clssification!", "regression?", "classification??", "regression!!", I thought why not start with the topic whose starting alphabet has a higher ASCII value, which is "R", i.e. regression....(Kudos to Random logic :'D). Well but the main purpose was to understand a regression problem (i.e when we are predicting a continuous variable) and applying machine learning techniques to obtain the objective i.e. a decent accuracy. The continuous variable here is the MEDV (median value of the owner occupied homes). This BHPP is also a beginners project to deploy a data analyst career, as enlisted by Kaggle.

Objective:

Let be precise:

  1. Obtaining and cleaning the data.
  2. Data Exploration through data Visualization, popularly termed as Exploratory Data Analysis (EDA).
  3. Optimal and necessary Feature Selection and data partitioning (Train and Test)
  4. Fitting various regression models, and Random Forest model with grid search, and cross validation.
  5. Have also applied PCA to check if it increses accuracy.
  6. The error matrices used to check the accuracy was RMSE, MAPE, MAE, MSE.
  7. Tried to create self help plotting packages available in R_utils repository

Some visulization for the dataset:

The dataset is obtained from Kaggle stored in here

The Description of the dataset:

The Histogram of MEDV

Correlation between various attributes

Procedure:

I have mainly deployed the problem using three different linear regression models with repeated cross validation for model training.

Model 1:

$MEDV= \alpha + \beta X + \epsilon$ where
$X$: the matrix containing all the independent variables,
$Y$: the dependant variable MEDV.
$\epsilon$ : residual error.

Model 2:

$log(MEDV)= \alpha+ \beta X + \epsilon$

Model 3:

$log(MEDV)=\alpha+\beta X+\epsilon, where \ X \ is \ a \ matrix \ of \ ~ DIS+RAD+LSTAT+NOX+INDUS+CRIM+TAX $