A Data-Driven Analysis of Global Income Distribution: Modeling and Analysis

Income disparity is a significant issue that affects the global population at varying levels. Efforts have been put together to curb it and improve the socio-economic fabric of our societies. This project aims to predict the income bracket of individuals based on a variety of features, and presents a holistic comparative analysis between multiple machine learning algorithms through hyperparameter optimization on a binary classification problem. Using machine learning, the model attempts to predict whether (Y/N) the income of a certain individual, with certain attributes (= features), exceeds $ 50,000 per annum. Three supervised, non-paramteric algorithms have been employed for evaluation i.e., K-nearest Neighbor, Support Vector Machine, & Random Forest.

Dataset

The Adult Data Set available at the UCI Machine Learning Repository is worked with to obtain results. The model is trained with 80% of the dataset and validated on the remaining 20%.

The data set is decribed to have the following characteristics:

48842 instances
8 categorical attributes and 6 continous
3620 instances with missing values
Target variable : income (>50K, <=50K)

The feature set is as follows:

Feature Selection and Engineering

The correlation matrix for the continuous features compared with the target variable is shown below:

I have dropped the categorical feature ‘education’ from our dataset, since it being the same as 'education-num', with the latter imposing ordinality. Features 'capital-gain' and 'capital-loss' are highly skewed and as such to minimize skewness, I have taken the square root for all instances of these features.

Data Preprocessing

I have employed the following steps to transform the dataset into a more representative form:

Missing Data Imputation
Label Encoding
One-Hot Encoding
Feature Scaling

Model Building & Training

There are three machine learning algorithms employed for this project:

K-Nearest Neighbor
Support Vector Machine
Random Forest

I have employed stratified k-fold cross validation, which is a variation of the k-fold cross validation technique that ensures each fold has approximately the same percentage of target class samples, thus addressing the dataset imbalance to an extent. In addition, it addresses the key issue of overfitting and promotes model generalization. Furthermore, the performance of a model significantly depends on the values of the model hyperparamters. I have employed the use of GridSearchCV to search all possible combinations of hyperparamter values, to determine optimal values for each of the three models.

Model Evalution

Each model has been assesed based on these evaluation metrics:

Accuracy
Confusion Matrix
Reciever Characteristics Curve (ROC)

A comparison of predictive accuracy obtained with those in literature is represented in the table below: