Why use the the same dataset for the train and validation data ?

Question

Why use the the same dataset for the train and validation data ?

cfsmile opened this issue a year ago · 2 comments

Dear Kawin,

In line 27, finetune.sh, why set the "validation_file" to snli_train_std.csv, instead of snli's validation split or test split?
In lines 216, 218, 321,322, run_glue_no_trainer.py, it seems the train data is the snli_train_std.csv and the eval data is the snli_train_std.csv as well. Is it OK? Or are there any specific considerations in this setting?

In line 39, finetune.sh, model training for the null-input-data , I also have the same question.

Would you be kind enough to elaborate on this for me?

Thanks in advance.

Fred

cfsmile commented a year ago

Thanks!

Answer 1 · 2023-08-04T22:28:19.000Z

In the Appendix of the paper, we have a figure showing that it's possible to over-estimate the V-info if you overfit to the training data and then estimate the v-info using the training data. Similarly, you'd under-estimate the V-info if you applied the overfitted model to some held-out data.

This filenames are likely a holdover from when that figure was generated.

For accurate estimates, you want to neither overfit nor underfit the model and then estimate the V-info on some held-out data (as done in the other figures in the paper).