https://github.com/ViduraPrasangana/pump-it-up.git
public_meeting
permit
management_group
extraction_type_group
extraction_type_class
installer
date_recorded
- there is no relationshipconstruction_year
- removed and added no of working yearswpt_name
- almost every value is uniquenum_private
- most of data are 0subvillage
- low data for each categoryregion_code
- duplicate of regionrecorded_by
- there is no relationshipscheme_name
- each category has small number of datapayment
- payment and payment_type has almost same classesquality_group
- quality_group and water_quality has almost same classesquantity_group
- quantity_group and quantity has same classessource_type
- source is almost samesource_class
- source is almost samewaterpoint_type_group
- waterpoint_type_group and waterpoint_type have almost same classes
Below columns had 0 for missing values
gps_height
population
amount_tsh
longitude
latitude
construction_year
First replaced with mean of respective region and district_code Remaining null values replaced with mean of region Remaining null values replaced with overall median
Only construction_year
is replaced with median as mean gives year with decimal points
Below categorical column had NaN for missing values
funder
scheme_management
Replaced with new category other
Following categories had large values and normalized to 0 - 20 using Min - Max scaler
amount_tsh
gps_height
population
- new column named
no_working_years
created fromdate_recorded
andconstruction_year
- RandomForestClassifier (sklearn)
- XGBClassifier (xgboost)