Can't pre-produce the result reported in paper with 'similar' dataset

Question

Can't pre-produce the result reported in paper with 'similar' dataset

miniweeds opened this issue 2 years ago · 2 comments

I tried to test the moderation API performance with the jigsaw dataset from Kaggle. The performance is quite worse than what was reported in the paper. Why? What am I missing? Here are the parameters for my test:

moderation API: https://api.openai.com/v1/moderations
Date: 1/26/2023
Label: 1 (toxic), 0 (non-toxic)
Test dataset: 181,000 records, 0: 167344, 1: 13656
1/10 data from train.csv from this link https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-

My Test result:

accuracy: 0.92
f1: 0.23
AUPRC: 0.33

Paper (https://arxiv.org/pdf/2208.03274.pdf) reported much better AUPRC, as below:
Jigsaw
Identity-hate .6890
Insult .8548
Obscene .8353*
Threat .6144*
Toxic .9304*

Answer 1 · 2023-01-27T04:52:16.000Z

Or is it because the jigsaw dataset in the paper is different from what I used? I tried the non-English jigsaw dataset too. The performance was worse.

Answer 2 · 2023-01-28T03:23:02.000Z

I see the problem now. The moderation API only classifies the following categories: "hate”, “hate/threatening”, "self-harm”, "sexual”, "sexual/minors”, "violence”, “violence/graphic". The jigsaw dataset I used covers much more categories. That explains why the API got such low AUPRC, the model and the test set don't align. In this case this API is not suitable for the jigsaw type of problems.

Here are the categories in Jigsaw train dataset: severe_toxicity,obscene,identity_attack,insult,threat,asian,atheist,bisexual,black,buddhist,christian,female,heterosexual,hindu,homosexual_gay_or_lesbian,intellectual_or_learning_disability,jewish,latino,male,muslim,other_disability,other_gender,other_race_or_ethnicity,other_religion,other_sexual_orientation,physical_disability,psychiatric_or_mental_illness,transgender,white.