Data Analysis of Adult dataset.
Histogram:
Box plots:
Barplot of categorical features:
Pairplot:
Barplot for numerical vs categorical features:
IQR:
iqr = 1.5 * (np.percentile(df[field_name], 75) -
np.percentile(df[field_name], 25))
df.drop(df[df[field_name] > (
iqr + np.percentile(df[field_name], 75))].index, inplace=True)
df.drop(df[df[field_name] < (np.percentile(
df[field_name], 25) - iqr)].index, inplace=True)
return df
df2 = remove_outlier_IQR(df,'final-wt')
df_final = remove_outlier_IQR(df2, 'hours-per-week')
df_final.shape
(36312, 15)
Boxplot after outliers removal
- using dummy variables.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X = data.drop(columns=['income_<=50K', 'income_>50K'])
y = data['income_<=50K']
scaler = StandardScaler()
scaled_df = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
scaled_df, y, test_size=0.3)
print("X train shape: {} and y train shape: {}".format(
X_train.shape, y_train.shape))
print("X test shape: {} and y test shape: {}".format(X_test.shape, y_test.shape)
X train shape: (25418, 108) and y train shape: (25418,) X test shape: (10894, 108) and y test shape: (10894,)