ivam-he/PSHGCN

Unfair evaluation settings on IMDB leads to unreasonable results (74+)

Closed this issue · 1 comments

First of all, thank you for your work.

I found that the evaluation for the multi-label dataset IMDB is unreasonable, which leads your method to an incredibly high F1 score (74+). To be specific, you cannot give the binary_pred with prior knowledge of how many classes for each node when evaluating. It's unfair.

    for i in range(preds.shape[0]):
        k = labels[i].sum().astype('int')
        topk_idx = preds[i].argsort()[-k:]
        binary_pred[i][topk_idx] = 1
        for pos in list(labels[i].nonzero()[0]):
            if labels[i][pos] and labels[i][pos] == binary_pred[i][pos]:
                num_correct += 1

In fact, it is usually to use metrics.f1_score(labels, preds>0) for evaluation.

Don't you think it is unfair for the other published and existing papers? Everyone is racing to compete, you cannot take a rocket to increase your scores by changing evaluation settings for your method.

Thank you for your attention.
Regarding our evaluation approach for the multi-label dataset IMDB, we referred to the paper Descent Steps of a Relation-Aware Energy Produce Heterogeneous Graph Neural Networks presented at NeurIPS 2022 (their code can be found HALO.) After rechecking both their code and ours, I confirmed the issue you mentioned. Our paper is currently in preprint, and we intend to address this problem in the next version, along with releasing the corresponding code, as soon as possible.