Sample Data Preprocessing, Cleaning and Feature Engineering
- Objective : Predict the sales price for each house. For each Id in the test set, one must predict the value of the SalePrice variable.
- NUMPY, pandas, Matplotlib.pyplot, Seaborn
- train.csv - the training set
- test.csv - the test set
SOLUTION:
- Training and testing dataset merging
-
Check the all data columns and their details
-
Set ID column as Index
-
Get the percentages of null value
-
Drop Columns - If the null value % 20 or > 20, Ir has to be dropped
-
Find the unique value count and unique value for each column
-
Modified Correlation Heatmap for Highly Correlated Features with SalePrice
- Check for null values
- Check for BSMT and Garage features and its missing values
- Check for null values
- Fill in with NAN values
- Check shape of data
- Create a bucket using range
- Replace NAN value of BsmtFinType2 by mode
- Handling missing value of remaining features
- Check for Unique and Nulll values of other features
- Use Fillna to replace null values
- Handling missing value of LotFrontage feature
- Convert columns in str which have categorical nature but in int64
- Convert a time related feature in month abbrevation
- Creating a list for modified columns
- Define Data Categories
- Catinate the category codes columnwise
- Keep unique value columns (Columns with few varieties of value)
- Get object feature to convert in numeric using dummy variable
- Select and drop the dummy variables
- Check Shape of modified data
- Scale dataset with robust scaler
- Check length of Training dataset
- Apply k-fold Cross Validation to get the correct patterns from the data
- Linear Regression
- Support Vector Machine
- Decision Tree Regressor
- Random Forest Regressor
- Bagging & boosting
- XGBoost
- RandomizedSearchCV, GridSearchCV for SVM model
- XGBRegressor for XGBoost
- Correlation Barplot for feature Engineering
- Drop features those are not required
'YrSold', 'LowQualFinSF', 'MiscVal', 'BsmtHalfBath', 'BsmtFinSF2', '3SsnPorch', 'MoSold'
Run models again after Feature Engineering and Hyperparameter Tuning
- Use 'Pickle' to save the model
- Use the data 'model_house_price_prediction.csv'
- Support Vector Machine with Accuracy = 90%