kawine/dataset_difficulty

can we use V-information to rank the difficulty of training instances?

Closed this issue · 1 comments

Hi, thanks for creating this repo. The code runs smoothly for me. I have a question about the application of this method- can we use it to score and rank training instances instead of test instances as is currently done in the code, and if yes, what should we modify? TIA!

kawine commented

Yes for sure. The caveat though is that if you overfit to the data during training, you can greatly over-estimate the usable information for the random variables X and Y. You can avoid this by using a validation set to track over-fitting. Typically the model only starts over-fitting past 2 epochs (see Appendix B of the paper for a graph of this).

For example, if you want to calculate the v-info of the SNLI training data, then in step 4 of the README you would just pass the filenames of the training data instead of the test data:

v_info(
  f"./data/snli_train_std.csv",
  f"{MODEL_DIR}/finetuned/bert-base-cased_snli_std2",
  f"./data/snli_train_null.csv", 
  f"{MODEL_DIR/finetuned/bert-base-cased_snli_null",
  'bert-base-cased',
  out_fn=f"PVI/bert-base-cased_std2_train.csv"
)