Hi. I have a question about the result (ACS CA->PR )
Closed this issue · 1 comments
Thank you for conducting such excellent research. While I was recently reviewing this research, I encountered strange results for one experiment. The following code is an experiment for the Income dataset for 2018. In the in-sample experiment, similar results to the paper are obtained, but in the out-of-sample (CA->PR) experiment, much worse results than those reported in the paper are obtained. Is there anything wrong with the way I implemented it?
PS. Even after parameter tuning as suggested in the paper, there was no significant difference in the results (out-of-sample) below.
from xgboost import XGBClassifier
from sklearn.metrics import f1_score,roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
from whyshift import get_data, fetch_model
import pandas as pd
import numpy as np
import random
def data_sampling(X,Y, ratio):
dataset = pd.concat([pd.DataFrame(X), pd.DataFrame({"class": Y})], axis = 1)
split = StratifiedShuffleSplit(n_splits=1, test_size=ratio, random_state=1004)
for train_idx, valid_idx in split.split(dataset, dataset["class"]):
df_train = dataset.loc[train_idx].reset_index().iloc[:, 1:]
df_test = dataset.loc[valid_idx].reset_index().iloc[:, 1:]
return df_train, df_test
def model_result(in_data, in_data_test, out_data_test):
f1=[]
acc = []
model = XGBClassifier(random_state=0, eval_metric='logloss').fit(in_data.iloc[:, :-1].to_numpy(), in_data.iloc[:, -1:].to_numpy())
pred_x = model.predict(in_data_test.iloc[:, :-1].to_numpy()) # in_data predic
f1.append(f1_score(in_data_test.iloc[:, -1:], pred_x, average = "macro"))
acc.append(accuracy_score(in_data_test.iloc[:, -1:], pred_x))
pred_x = model.predict(out_data_test.iloc[:, :-1].to_numpy()) # out_data predic
f1.append(f1_score(out_data_test.iloc[:, -1:], pred_x, average = "macro"))
acc.append(accuracy_score(out_data_test.iloc[:, -1:], pred_x))
result = pd.DataFrame({"setting":["in_sample","out_of_sample"],"f1": f1, "acc":acc})
return result
X5, y5, feature_names = get_data("income", "CA", True, './datasets/acs/', 2018)
X7, y7, feature_names = get_data("income", "PR", False, './datasets/acs/', 2018)
X5 = np.delete(X5, [43,68], 1)
X7 = np.delete(X7, [43,68], 1)
df_CA_train, df_CA_test = data_sampling(X5, y5, 0.2)
df_PR_train, df_PR_test = data_sampling(X7, y7, 0.2)
CA_PR = model_result(df_CA_train, df_CA_test, df_PR_test)
The result is here
f1 | acc | |
---|---|---|
in_sample | 0.810529 | 0.816881 |
out_of_sample | 0.280381 | 0.284848 |
The third parameter of get_data
controls whether to do data pre-processing. In this setting, both should be set to False I think.