/adult_dataset_analysis

This repository contains the EDA, data preprocessing and ML model training and evaluation of the adult dataset.

Primary LanguageJupyter Notebook

Adult Dataset Data Analysis

Data Analysis of Adult dataset.

Table of contents

EDA

Univariate Analysis

Histogram:

histogram


Box plots:

boxplot


Barplot of categorical features:

barplot

Bivariate Analysis

Pairplot:

pairplot

Barplot for numerical vs categorical features:

barplot

Data Preprocessing

Removing outliers and missing values

IQR:

    iqr = 1.5 * (np.percentile(df[field_name], 75) -
                 np.percentile(df[field_name], 25))
    df.drop(df[df[field_name] > (
        iqr + np.percentile(df[field_name], 75))].index, inplace=True)
    df.drop(df[df[field_name] < (np.percentile(
        df[field_name], 25) - iqr)].index, inplace=True)
    return df

df2 = remove_outlier_IQR(df,'final-wt')
df_final = remove_outlier_IQR(df2, 'hours-per-week')
df_final.shape

(36312, 15)

Boxplot after outliers removal

outliers_boxplot

Encoding categorical features

  • using dummy variables.

Data preparation for training and testing

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = data.drop(columns=['income_<=50K', 'income_>50K'])
y = data['income_<=50K']

scaler = StandardScaler()
scaled_df = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    scaled_df, y, test_size=0.3)
print("X train shape: {} and y train shape: {}".format(
    X_train.shape, y_train.shape))
print("X test shape: {} and y test shape: {}".format(X_test.shape, y_test.shape)

X train shape: (25418, 108) and y train shape: (25418,) X test shape: (10894, 108) and y test shape: (10894,)

Model Training and Evaluation

Random Forest Classifier

rfc

Logistic Regression

lgr

K Nearest Neighbors

knn

Naive Bayes

naiv