facebookresearch/MetaICL

Setting k for pre-processing train data

thomaspzollo opened this issue · 4 comments

How can we change the amount of training samples produced by the preprocessing script? It seems many of the files (e.g. ade_effect.py, anli.py, etc) have k hardcoded and thus are not producing the number of examples passed in the arguments.

Thank you!

Hi! You can actually easily control the number of examples (k). Please see the instruction here. You just need to run either _build_gym.py or unifiedqa.py to preprocess all datasets at once, and use --test_k or --train_k to control k.

Yes I did that, but doesn't build gym call e.g. agnews.py, which runs this:
train, dev, test = dataset.generate_k_shot_data(k=16, seed=seed, path="../data/")
and the other files I've inspected do the same, so my train, dev, and test sets each have 16 items no matter what I pass as the parameters.

It actually uses flags here and use train_k and test_k here instead of k=16. Please refer to these lines if you want to double-check.

It is possible you don't get the expected results if you use a different command. Please try the command line in README, and if it still uses k=16, then let me know.

I got it now, thank you for the help!