can we use V-information to rank the difficulty of training instances?

Question

can we use V-information to rank the difficulty of training instances?

Closed this issue a year ago · 1 comments

Hi, thanks for creating this repo. The code runs smoothly for me. I have a question about the application of this method- can we use it to score and rank training instances instead of test instances as is currently done in the code, and if yes, what should we modify? TIA!

Answer 1 · 2023-05-21T20:48:54.000Z

Yes for sure. The caveat though is that if you overfit to the data during training, you can greatly over-estimate the usable information for the random variables X and Y. You can avoid this by using a validation set to track over-fitting. Typically the model only starts over-fitting past 2 epochs (see Appendix B of the paper for a graph of this).

For example, if you want to calculate the v-info of the SNLI training data, then in step 4 of the README you would just pass the filenames of the training data instead of the test data:

v_info(
  f"./data/snli_train_std.csv",
  f"{MODEL_DIR}/finetuned/bert-base-cased_snli_std2",
  f"./data/snli_train_null.csv", 
  f"{MODEL_DIR/finetuned/bert-base-cased_snli_null",
  'bert-base-cased',
  out_fn=f"PVI/bert-base-cased_std2_train.csv"
)