Hi. I have a question about the result (ACS CA->PR )

Question

Hi. I have a question about the result (ACS CA->PR )

Closed this issue 7 months ago · 1 comments

Thank you for conducting such excellent research. While I was recently reviewing this research, I encountered strange results for one experiment. The following code is an experiment for the Income dataset for 2018. In the in-sample experiment, similar results to the paper are obtained, but in the out-of-sample (CA->PR) experiment, much worse results than those reported in the paper are obtained. Is there anything wrong with the way I implemented it?

PS. Even after parameter tuning as suggested in the paper, there was no significant difference in the results (out-of-sample) below.

from xgboost import XGBClassifier
from sklearn.metrics import f1_score,roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import accuracy_score
from whyshift import get_data, fetch_model
import pandas as pd
import numpy as np
import random 

def data_sampling(X,Y, ratio):
    dataset = pd.concat([pd.DataFrame(X), pd.DataFrame({"class": Y})], axis = 1)
    split = StratifiedShuffleSplit(n_splits=1, test_size=ratio, random_state=1004)
    for train_idx, valid_idx in split.split(dataset, dataset["class"]):
        df_train = dataset.loc[train_idx].reset_index().iloc[:, 1:]
        df_test = dataset.loc[valid_idx].reset_index().iloc[:, 1:]
    return df_train, df_test


def model_result(in_data, in_data_test, out_data_test):
    f1=[]
    acc = []

    model = XGBClassifier(random_state=0, eval_metric='logloss').fit(in_data.iloc[:, :-1].to_numpy(), in_data.iloc[:, -1:].to_numpy())
    pred_x = model.predict(in_data_test.iloc[:, :-1].to_numpy()) # in_data predic
    
    f1.append(f1_score(in_data_test.iloc[:, -1:], pred_x, average = "macro"))
    acc.append(accuracy_score(in_data_test.iloc[:, -1:], pred_x))

    pred_x = model.predict(out_data_test.iloc[:, :-1].to_numpy()) # out_data predic
    f1.append(f1_score(out_data_test.iloc[:, -1:], pred_x, average = "macro"))
    acc.append(accuracy_score(out_data_test.iloc[:, -1:], pred_x))
    
    result = pd.DataFrame({"setting":["in_sample","out_of_sample"],"f1": f1, "acc":acc})

    return result

X5, y5,  feature_names = get_data("income", "CA", True, './datasets/acs/', 2018)
X7, y7,  feature_names = get_data("income", "PR", False, './datasets/acs/', 2018)
X5 = np.delete(X5, [43,68], 1)
X7 = np.delete(X7, [43,68], 1)

df_CA_train, df_CA_test = data_sampling(X5, y5, 0.2)
df_PR_train, df_PR_test = data_sampling(X7, y7, 0.2)

CA_PR = model_result(df_CA_train, df_CA_test, df_PR_test)

The result is here

	f1	acc
in_sample	0.810529	0.816881
out_of_sample	0.280381	0.284848

Answer 1 · 2024-02-13T00:58:40.000Z

The third parameter of get_data controls whether to do data pre-processing. In this setting, both should be set to False I think.