House-Price-Prediction

The aim of this project is to identify the suitable model to make the prediction for the house price with given significant predictor variables and used a supervised learning technique.

The data-set is available on Kaggle: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

Data-set examination

The data-set is CSV format shown below:

The train data-set has 1460 samples, 80 features and 1 target variable.
The test data-set has 1459 samples and 80 features.
The target variable is the sale price.

Data-set pre-processing

The heat-map showed the correlation of each variable with another. The darker colour means the relationship between any two variables are strongly correlated and lighter colour means they have almost no relationship.

The selected top 10 variables that highly correlated to 'SalePrice' from heat-map shown below:

By observing the above heat-map, the provided data is sufferring multicollinearity such as 'TotRmsAbvGrd' with 'GrLivArea', 'TotalBsmtSF' with '1stFlrSF', 'GarageCars' with 'GarageArea' and more, which independent variable has the strong relationship with another independent variable and the model performance will be affected.

The selected top 10 variables in 'pairplot':

By looking the 'SalePrice' on the y-axis and compare with 'GrLivArea', 'TotalBsmtSF' and '1stFlrSF' on the x-axis, they seem do have outliers. since the 'TotalBsmtSF' and '1stFlrSF' are strongly correlated, only the 'TotalBsmtSF' has been taken for further analysis and 'GrLivArea' as well.

Analyse outlier

From above two images, we could clearly see that several data points (red circle) are lies on abnormal distance from other values in a random samples, this could cause a problem, heteroscedasticity, which forming a cone-like shape pattern result in standard error bias. The cone-like shape pattern (green lines) shown below:

There are a lot of techniques to deal with outliers, but, above mentioned outliers have been removed to keep process simpler.

Remove weak features

Those weak features with the correlation coefficient less than 0.2 that almost no relationship with the 'SalePrice' have been removed.

Display and impute missing data

Removing all those features with missing data aren't a good practice, because some of those features might be important. However, there will be a tedious work to fill up the missing data that depending on domain knowledge and experience. In order to keep the process simple, all the features with missing data have been removed except 'Electrical', which only one missing data. The missing data will be filled in with the most common value in 'Electrical'.

Solving normality to prevent heteroscedasticity

The normality graph for 'SalePrice', 'GrLivArea' and 'TotalBsmtSF'. (Click on the picture to zoom in)

This process is to transform the data into a normal distribution shape used the log transformation. The probability plot applied, which if the data points lie on the diagonal line, it means the particular feature more likely to be a normal distribution. Also, the skewness and kurtosis are indicating whether the feature is a normal, left skew, right skew distribution, heavier tails or light tails.

According to the rule of thumb:

Reference: https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics
Skewness
- If the skewness is between -0.5 and 0.5, the data are fairly symmetrical
- If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed
- If the skewness is less than -1 or greater than 1, the data are highly skewed
Kurtosis
- If the kurtosis is close to 0, then a normal distribution is often assumed. These are called mesokurtic distributions.
- If the kurtosis is less than zero, then the distribution is light tails and is called a platykurtic distribution.
- If the kurtosis is greater than zero, then the distribution has heavier tails and is called a leptokurtic distribution.

The data has become less likely cone-like shape pattern:

The remaining work

The ordinal variables have been label encoding by converting a string into an ordered number.
All year type variable's value converted into year interval. For instance: original = 1995 -> year interval = 23 = 2018 - 1995
Lastly, all the data transformed into dummy values.
The testing data-set has go through the similar process as training data-set does (e.g data transformation, remove weak features and etc.)

Model training

The XGBoost model will be used in this project. XGBoost stands for eXtreme Gradient Boosting. XGBoost is fast and dominates structured or tabular datasets on classification and regression predictive modeling problems.

The good explanation is here: https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/

The XGBoost's hyper-parameters have been randomly selected by using 'RandomizedSearchCV' library in order to get the best hyper-parameters with lesser execution time. After the training data fit into the XGBoost model, the result generated shown below:

Note: 'Test' showed in the result is actually a validation data and 'r2' is R-squared.

The result shows that the 'Test r2' is slightly lower than 'Train r2', which means the model is little bit over-fitting.

Standardized residual shape pattern

QQ plot

visual checking whether the standardized residual is a normal distribution. The standardized residual seems like not close to normal distribution.

Residual plot

reference: http://docs.statwing.com/interpreting-residual-plots-to-improve-your-regression/#y-unbalanced-header

The data points are not so evenly distributed vertically, the model has room for improvement.

Predict unseen data-set (testing data-set)

The RMSLE (Root Mean Squared Logarithmic Error) that I obtained:

Other top 6 competitor results:

Summary

After the result comparison, I still need to put more effort to improve the model. The result reflects that some of the valuable data might not yet to be discovered from the dataset. Probably need to review all the missing data, outliers, also, spend more time on data analysis and multicollinearity issue.

Working enviroment

Google Colab

Python 3
xgboost 0.7.post4
sklearn 0.19.2

Pzugatti/House-Price-Prediction