- In this project, I will build a web app for predicting the house price in Ho Chi Minh City based on the datasets scraped at website Propzy.
- App: https://hcmhouseprice.herokuapp.com/
├───assets (containing file for web layout design)
│ style.css
├───data chunk (containing separated data for each district)
│
├───data (containing data for processing)
│
├── app.py
├── crawl_data.ipynb
├── eda_cleaning.ipynb
├── feature_engineering_selection.ipynb
├── model.ipynb
├── final_model.sav
├── Procfile
└── requirements.txt
- For scraping, I using BeautifulSoup to collect data from website Propzy.
- Drop duplicated values
- Extracted and create new information from text desription of each house.
- Correct wrong price and numeric value of observations
- Correct missing values
- Examine missing values
- Analyze numerical variables and their distribution
- Analyze categorical variables and their cardinality
- Detect outliers
- Analyze relationship between all the features of house and the house price
- Remove outliers
- Complete missing values
- Transform numerical variables due to its skew distribution
- Encode categorical variables for model building
- Create new feature from heading title
- Oversampling data
- Cluster and PCA
- Drop redundant features
- Remove highly correlated features
- Examine features importance
- Remove anomaly observations
- Perform K-fold cross validation
- Use Random Forest, XGB and LightGBM algorithm for training datasets
- Perform RandomizedSearchCV for optimizing score
- This project aims to help people to somewhat determine a price for their real estate to sell as well as to be able to determine if the houses they intend to buy are being sold for a reasonable price. However, above all, the main purpose of this project is to have a fun time when playing with machine learning.