The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.
This project was started as a motivation for learning Machine Learning Algorithms and to learn the different data preprocessing techniques such as Exploratory Data Analysis, Feature Engineering, Feature Selection, Feature Scaling and finally to build a machine learning model. We will predicts house price in boston city
The data was originally published by Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978
. The dataset is collected from Kaggle. Let's get into the data and know more about it.
- Origin
- The origin of the boston housing data is Natural.
- Usage
- This dataset may be used for Assessment.
- Number of Cases
- The dataset contains a total of 506 cases.
- Order
- The order of the cases is mysterious.
- Variables
There are 14 attributes in each case of the dataset. They are:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per 10,000 dollars.
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in 1000's dollars
Before performing Modeling, we will pre-process the dataset by conducting the following steps:
- Finding Correlation between the
Predictor variables
-
A. Correlation matrix between SalePrice with other variables
B. SalePrice correlation matrix
- Find Missing Values and impute using
K-Means
if necessary -
A. Computing percent of missing Values
B. Plotting the Proportion of Missing Values
- Perform Outlier Detection, to remove values that can decrease the model accuracy and lead to inappropriate predictions -
A. Univariate Analysis
- Analysing the target variable
SalePrice
- We will check the Correlation of Target variable with Prediction variables to handleMulti-Collinearity
. Also, check the skewness for 'GrLivArea' and 'TotalBsmtSF'
After proceessing the data, we have implemented the following Regression models
-
Linear Regression
Random Forest Regressor With different Depth Level
-
- R-squared (R2)
- Root Mean Square Error (RMSE)
- Best Score
- Cross-Validation Score
On the basis of our evaluation parameters calculated for each model below are the observations:
- R-squared is a statistical measure of how close the data are to the fitted regression line. The higher the R-squared, the better the model fits the data -->
Ridge regression
(0.9285) - Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Lower the RMSE, the better the model fits the data -->
Ridge regression
(0.1021) - Higher the Best score the better the model fits the data -->
Ridge regression
(0.8857) - Higher Cross validation score means model performing well on the validation set, indicating that it may perform well on the unseen data(test set) -->
Ridge regression
(0.8927)