r-three/t-few

Validation score on WSC decreases with training

Closed this issue · 3 comments

Thank you for the amazing work on t-few! I've noticed strange behavior when I am running superglue's wsc. I've been logging the validation score every 40 epochs using self.eval_epoch_interval = 40 and when running the command:
python -m src.pl_train -c ia3.json+wsc.json -k save_model=False exp_name=first_exp the output is as following:

{"accuracy": 0.6730769230769231, "score_gt": 0.5068197436630726, "score_cand": 0.7191649047801127}
{"accuracy": 0.49038461538461536, "score_gt": 1.4563168384707892, "score_cand": 1.505529030584372}
{"accuracy": 0.47115384615384615, "score_gt": 3.4743554890155792, "score_cand": 2.727144861450562}
{"accuracy": 0.46153846153846156, "score_gt": 4.202766236777489, "score_cand": 3.5702959763316007}
{"accuracy": 0.40384615384615385, "score_gt": 5.157541000499175, "score_cand": 3.5657502871293287}
{"accuracy": 0.3942307692307692, "score_gt": 5.397989429533482, "score_cand": 3.975659689651086}
{"accuracy": 0.40384615384615385, "score_gt": 5.073869264469697, "score_cand": 3.995581218542961}

The last accuracy score is reported at 240 epochs out of a total 250 epochs.

Any ideas on what is going on here? Thanks!

I can try running this experiment maybe later half of this week. Meanwhile, I remember WSC to be a tricky dataset that often produces unstable results. Would you mind running it with a few other seeds and seeing if this behavior persists?

And btw, is this just WSC? Do other datasets have this problem?

Hi Haokun, thank you for the response. Indeed, after changing the seed results are more as expected. I have been having similar problems with WiC, but again it appears to be caused by the variability of the seed. RTE seems more stable.