Questions around the data preprocessing

Question

Questions around the data preprocessing

JunShern opened this issue 3 years ago · 5 comments

Hi again! I would like to run the training procedure with my own custom datasets, but I'm finding the data setup quite confusing.

In particular, I'm trying to understand the preprocessing done to generate the files in the MetaICL/data directory. Since I am not using HuggingFace datasets, I think the easiest route for me is to adapt unifiedqa.py to take my own input and output the right format.

However, looking at the files that have been generated in my MetaICL/data directory, I see a lot of files and I do not understand how they are used:

$ tree MetaICL/data
data
├── ade_corpus_v2-classification
│   ├── ade_corpus_v2-classification_16_100_dev.jsonl
│   ├── ade_corpus_v2-classification_16_100_test.jsonl
│   ├── ade_corpus_v2-classification_16_100_train.jsonl
│   ├── ade_corpus_v2-classification_16_13_dev.jsonl
│   ├── ade_corpus_v2-classification_16_13_test.jsonl
│   ├── ade_corpus_v2-classification_16_13_train.jsonl
│   ├── ade_corpus_v2-classification_16_21_dev.jsonl
│   ├── ade_corpus_v2-classification_16_21_test.jsonl
│   ├── ade_corpus_v2-classification_16_21_train.jsonl
│   ├── ade_corpus_v2-classification_16384_100_dev.jsonl
│   ├── ade_corpus_v2-classification_16384_100_train.jsonl
│   ├── ade_corpus_v2-classification_16_42_dev.jsonl
│   ├── ade_corpus_v2-classification_16_42_test.jsonl
│   ├── ade_corpus_v2-classification_16_42_train.jsonl
│   ├── ade_corpus_v2-classification_16_87_dev.jsonl
│   ├── ade_corpus_v2-classification_16_87_test.jsonl
│   └── ade_corpus_v2-classification_16_87_train.jsonl
├── ade_corpus_v2-dosage
│   ├── ade_corpus_v2-dosage_16_100_dev.jsonl
...

I understand that the files are named with {task}_{k}_{seed}_{split}.jsonl, but I am confused how these files are used / which are used during train / test.

My main questions are:

Can you please explain how each of those files is used during training and testing?
Why do you generate so many files, instead of simply 3 files for train, dev and test?

In case it's not covered in the general explanation, I also have some additional questions from looking through the code:

With the default setup, it seems like only *_16384_100_train.jsonl is used during training. So if I want to train on a custom dataset, I can just put my file in data/my_task/my_task_16384_100_train.jsonl without any of the other files, and that should be enough to run the train procedure?
The *_16_{seed}_train.jsonl files are always 16 lines long, whereas *_16_{seed}_test.jsonl files are always much longer. Why?
As far as I can tell, the *_dev.jsonl files are never used?
Is there something special about seed=100? Looking at this and this.

Thank you very much in advance!

Answer 1 · 2022-02-01T16:51:00.000Z

Thanks for your questions.

Basically, we generate k-shot train and dev datasets which includes k random examples from the train data, and test datasets that are exactly the same as the original test data. As you can see, we have 5 train/dev/test files, using different seeds. This is because we want to experiment with a different sets of training data so that we can alleviate the problem of high variance in few-shot learning. (Note that all test files are actually identical. We duplicated them just for convenience.)

To answer your specific questions as well:

*_16384_100_train.jsonl is used for meta-training. If you want to meta-train the model on your data, and won't use the data for evaluation, it is indeed right that you can put it without any other files. If you want to use your data for few-shot evaluation, you still need other files.
The *16{seed}_train.jsonl files always include 16 training examples, since they are for 16-shot learning.
Yes, dev sets are not used.
This is only because for meta-training, we use seed=100 and do not use other seeds. As I said above, using multiple seeds is only because there is high variance across different choices of k training examples when k is small (e.g. 16). But for meta-training, we assume a large pool of training examples is available (16384, specifically). Thus, we don't really need to experiment with multiple seeds.

Let me know if you need any clarification -- might be easier to answer if you could clarify whether you will use your data only for meta-training (recommended when you have a large pool of examples), or you want to use your data for few-shot learning (when your data is very small, e.g. 16).

Answer 2 · 2022-02-02T14:53:02.000Z

Thank you for the explanation! I understand much better now. Just to recap my overall understanding in pseudocode:

Running evaluation (test.py) does:

for task in eval_tasks:
  for seed in [100, 13, 21, 42, 87]:

    # Randomly sample a 16-shot context
    random.seed(seed)
    k_shot_context = random.choice(task['train'], size=16) # list of 16 (x,y) pairs
    
    # Evaluate with the context on all test examples
    for x, y in task['test']:
      prompt = str(k_shot_context) + str(x)
      y_pred = model.generate(prompt)
      score = calc_score(y, y_pred)

where in your implementation,

k_shot_context is cached to file as *_16_{seed}_train.jsonl
task['test'] is cached to file as *_16_{seed}_test.jsonl

Running training (train.py) does:

k = 16
for step in range(n_train_steps):

  # Randomly sample a task to train on
  task = random.choice(train_tasks)

  # Randomly sample a 16-shot context + 1 query example
  train_set = random.choice(task['train'], size=k+1)
  k_shot_context, query_pair = train_set[:-1], train_set[-1]

  # Train the model to predict the query example given the context
  prompt = str(k_shot_context) + query_pair.x
  y_pred = model.generate(prompt)
  loss = loss(query_pair.y, y_pred)
  loss.backward()

where in your implementation,

task['train'] is cached to file as *_16384_100_train.jsonl

Is that right?

Oh and my use case is to do meta-training using custom data, and evaluate on the existing tasks.

Answer 3 · 2022-02-02T14:54:18.000Z

I'm still confused about this line though. As far as I can tell, this line:

                                 "{}_{}_{}_{}.jsonl".format(dataset, k, seed if split=="train" else 100,
                                                          "test" if split is None else split))

could be simplified to "{}_{}_{}_{}.jsonl".format(dataset, k, seed, split).

As far as I can tell, the only time those conditionals make a difference is to force {seed}_test.jsonl to 100_test.jsonl. But this seems unnecessary since as you said, the _test.jsonl are always the same regardless of seed. It could be just an mistake, but I want to ask in case I am misunderstanding something.

Answer 4 · 2022-02-03T03:28:47.000Z

Everything you said is 100% correct. Thanks for writing this wonderful pseudo-code that perfectly explains how things work in a high-level. And you are again right about the line in utils/data.py --- this can be simplified to "{}_{}_{}_{}.jsonl".format(dataset, k, seed, split) because the test data is always the same no matter what seed is.

Answer 5 · 2022-02-04T03:00:53.000Z

Perfect, thank you so much @shmsw25. I have the understanding I needed, I will close this issue now.