tomekkorbak/pretraining-with-human-feedback

Why is toxicity threshold so low? It is set to 0.00056.

Opened this issue · 0 comments

In configs/toxicity/conditional.yml, we have the line

dataset:
  conditional_training_config:
    threshold: 0.00056
    aligned_prefix: "<|aligned|>"
    misaligned_prefix: "<|misaligned|>"
    drop_token_fraction: 0.01

Why here is the toxicity threshold 0.00056? This is incredibly low. Only sentences with toxicity scores lower than 0.00056 would be marked as non-toxic. Everything greater (or equal to) that would be marked as toxic.

Don't we only want documents to be marked as toxic when their toxicity is, let's say, 0.9 or greater? (I chose 0.9 arbitrarily as an example). Generally speaking, 0.00056 seems to be quite a low threshold and I'm worried that this might hurt performance.

Can you explain the thought process that went into making the toxicity threshold 0.00056? Is this simply what got the best results?

Thanks!