Real estate developers encounter difficulties when assessing the precise influence of individual metrics and attributes on house pricing within the KC housing dataset. Their primary concern is the degree to which these factors interact to affect pricing outcomes. The current lack of clarity in pricing decisions can result in instances of both overpricing and underpricing of properties. And to address this issue, we aim to develop a more comprehensive understanding of the dataset's variables, enabling them to make more accurate pricing decisions based on a combination of factors.
The project focuses on the creation of a machine-learning project for house-price forecasting for investors to owners to buyers.
House price forecasting is a crucial task in the real estate industry. Accurate predictions assist homebuyers, sellers, and investors in making informed decisions regarding property transactions.
- Jupyter Notebook The Jupyter Notebook is our key deliverable and contains details of our approach and methodology, data cleaning, exploratory data analysis and model building and validation.
I recommend using nbviewer to view the Jupyter Notebook.
-
Presentation The presentation gives a high-level overview of our approach, findings and recommendations for non-technical stakeholders. It is aimed to be between 5 and 10 minutes long.
-
Data
The dataset can be found in the file "kc_house_data.csv" in the Data folder, in this repository. It was originally provided in the following repository.
- Python version: 3.6.9
- Matplotlib version: 3.1.3
- Seaborn version: 0.9.0
- Pandas version: 0.25.1
- Numpy version: 1.16.5
- Statsmodels version: 0.10.1
- Scikit-learn version: 0.21.2
- Clone this repository - guidance.
- Dataset can be found in the file "kc_house_data.csv".
- Check requirements in Technologies section above and download libraries if necessary.
Here we will work on data cleaning, handling missing values, data transformation, handling duplicates, data reshaping and other processes to ensure that we have a clean, structured, and suitable format for analysis and modeling
Here we will explore the different features of the dataset to gain a better understanding of the data. We will use data vizualization to uncover trends and patterns. We will use Feature Engineering to create new features from existing ones and perform One-Hot Encoding on categorical variables that we will require for analysis.
Most houses are priced around a half million to a million dollars, while the most expensive houses imply the order of two million dollars and more
Overview of house features
- Categorical features of the house include
id
,date
,bedrooms
,floors
,waterfront
,view
,grade
,year_built
,yr_renovetd
,zipcode
,lat
,long
. - Numerical variables include
price
,sqft_living
,sqft_lot
,sqft_above
,sqft_basement
,sqft_basement
,sqft_living_above
,sqft_lot_below
. - it is can be noticed that as
bedrooms
increase, so does the house's selling price - more
floors
, preferably up to 2.5 have a higher price
Here we have the outcome of an Ordinary Least Squares (OLS) linear regression analysis performed on a dataset with 'price' as the dependent variable and 'sqft_living' as the independent variable.
- R² is approximately 0.495
- F-Statistic the high value (1.868e+04) suggests that the model is statistically significant.
- Intercept: The estimated value of 'price' when 'sqft_living' is 0. The coefficient is approximately -47,430.
- sqft_living (Coefficient for sqft_living): approximately 283.1303.
- RMSE is approximately 200,639
- R-squared (R²) is approximately 0.69
- The high F-statistic value (6096) suggests that the model is statistically significant.
- The RMSE is 127,472.58 which is lower meaning the model is more accurate
- The R² is 87%
In conclusion, our predictive model accounts for approximately 87% of the variance in house prices which signifies a strong predictive power. The factors considered include, square footage, location, view and waterfront, which have a substantial impact on property values. It's important to keep in mind, however, that real estate is influenced by many more dynamic variables thus achieving 100% accuracy in predicting house prices is challenging. Our model’s performance is encouraging and can aid in estimating property values in King County thus providing a reliable method for both buyers and sellers. Although this model is reliable, it should be used in conjunction with other information for more precise pricing decisions.
Name | GitHub |
---|---|
Priscillah Wairimu | https://github.com/Wairimukimm |
Lewis Kamindu | https://github.com/lewigi |
Brian Chacha | https://github.com/MarwaBrian |
Meshael Oduor | https://github.com/Ayangaoduor1 |
Lucy Waruguru | https://github.com/WacekeW |
Stephen Butiya | https://github.com/obystephen |