Clean your data
-
Check for any Fromating Errors (ex:- date in one row can be 5/7/2001 and in the next can be 10st may 2010)
-
Strings in numeric Field
-
Outliers The following row has an extreme (and unbelievable) value for number_of_bedrooms:
-
Missing Values The following row has a missing price:
-
Misspellings The following row has a misspelling in the type column:
-
Duplicates
-
Nulls and Nan
Create New Features From Existing Features
-
Binning
- Numeric Binning
- Categorical Binning
-
Splitting
- Date/Time Decomposition
- Compound String Splitting
-
One-Hot Encoding
- sometimes This approach introduces a problem For example if we assign some numeric values to regions like Aisa as 1 Europe as 2 somehow machine model will understand that europe is greater than asia This will be a problem