- Built several Machine Learning models (Logistic Regression, RandomFordst, and XGboost) to predict whether a person makes over 50K a year
- Several Feature Engineering methods to fill with columns that have NA values.
- Python 3.8
- Packages: pandas, numpy, seaborn, matplotlib, sklearn, Xgboost
- [Xgboost parameter] (https://xgboost.readthedocs.io/en/latest/parameter.html)
- [Data] (https://archive.ics.uci.edu/ml/index.php)
-
Normalize or use logistic to transform the numeric columns (age, fnlwgt, education_number, hours)
-
Group approximate equal columns to groups to reduce the dimension (workclass, martial_status, native_country)
- workclass: replace 'Without pay' and 'Never-worked' classes to 'Non-pay' class
- martial_status: replace 'Divorced' class to 'Seperated' class
- native_country: use continent to group each country.
-
one-hot encoding for categorical parameters
-
Built Xgboost model to predict workclass, occupation, and native_country's Null value
- Logistic model
- RandomForest model
- Xgboost model