Setting k for pre-processing train data

Question

Setting k for pre-processing train data

thomaspzollo opened this issue 2 years ago · 4 comments

How can we change the amount of training samples produced by the preprocessing script? It seems many of the files (e.g. ade_effect.py, anli.py, etc) have k hardcoded and thus are not producing the number of examples passed in the arguments.

Thank you!

Answer 1 · 2022-10-29T02:23:09.000Z

Hi! You can actually easily control the number of examples (k). Please see the instruction here. You just need to run either _build_gym.py or unifiedqa.py to preprocess all datasets at once, and use --test_k or --train_k to control k.

Answer 2 · 2022-10-29T10:21:41.000Z

Yes I did that, but doesn't build gym call e.g. agnews.py, which runs this:
train, dev, test = dataset.generate_k_shot_data(k=16, seed=seed, path="../data/")
and the other files I've inspected do the same, so my train, dev, and test sets each have 16 items no matter what I pass as the parameters.

Answer 3 · 2022-10-29T22:26:13.000Z

It actually uses flags here and use train_k and test_k here instead of k=16. Please refer to these lines if you want to double-check.

It is possible you don't get the expected results if you use a different command. Please try the command line in README, and if it still uses k=16, then let me know.

Answer 4 · 2022-10-31T15:36:42.000Z

I got it now, thank you for the help!