LechengKong/OneForAll

How can I test other settings on few-shot ability of OFA?

karin0018 opened this issue · 7 comments

Hi! First of all, thanks for your amazing work!
In Section E.2 you provide some experiment results spanning more ways and shots on ogbn-arxiv and FB15K237 datasets, such as 3/5-ways on ogbn-arxiv and 10/20-ways on FB15K237. I want to test other ways on these datasets, like 10/20/30/40-ways, how should I do it?

Can I just modify the config/data_config.yaml and config/task_config.yaml like this :

config/data_config.yaml

FB15K237_fs_403:
  <<: *FB15K237_fs
  args:
    walk_length: null
    single_prompt_edge: True
    n_way: 40
    k_shot: 3
    base_construct: ConstructKG
    no_class_node: True
    remove_edge: True
  num_classes: 40

config/task_config.yaml

FB15K237_fs: &FB15K237_fs
  <<: *LR-link
  dataset: FB15K237_fs
  eval_set_constructs:
    - stage: train
      split_name: train
      dataset: FB15K237_fs

...
    - stage: valid
      split_name: valid
      dataset: FB15K237_fs_403
    - stage: valid
      split_name: test
      dataset: FB15K237_fs_403

if not, what should I do to test other few-shot settings?

Hi @karin0018 , thank you for your interest! Yes, your config for inference is correct. However, you might also want to change the training n_way like here and here. This parameter controls the max n_way that the model sees during training. So direct inference on 40-way is technically possible, but you might want to train a model that sees 40-way classification before inference.

Hope this helps. We will push a major update and some bug fixes along with our camera-ready version, stay tuned.

Thanks for your reply! I try to train max n_ways=20 on arXiv dataset and max n_ways=40 in FB15K-237. After training 20 epoch, I got the results like this:

valid_arxiv_fs_33/acc=0.312, 
valid_arxiv_fs_53/acc=0.204, 
valid_arxiv_fs_103/acc=0.106, 
valid_arxiv_fs_203/acc=0.065, 
valid_FB15K237_fs_53/acc=0.057, 
valid_FB15K237_fs_103/acc=0.024, 
valid_FB15K237_fs_203/acc=0.011, 
valid_FB15K237_fs_403/acc=0.007, 

Is that reasonable? Also, I attempted to use your original training settings with epoch=50, but the training speed was very slow. It took 2 days to run the program and I only obtained results for 20 epochs. QAQ

Hi @karin0018, we just identified a dumb bug in few-shot scenario and it was fixed by this commit. Can you try pulling the repo again and train the model?

In case you are interested in the cause of the bug. For a node in the i^th class, we should have the i^th class node to be labeled as positive, however, in this implementation, we accidentally have i^th through the last node to be labeled as positive (see the commit for details). Hence, you are getting near random-guess results. We just tried training with the updated code, you should be able to reproduce the paper's results.

For the training time, do you mean end-to-end training or few-shot training? The end-to-end training script is expected to run roughly 2-days for 50 epochs on a Nvidia a100. However, it is abnormal if the low-resource experiment takes that long, did you also use a 40-class scenario?

Again, sorry for the bug, and we are serious about the reproducibility of our work, we will make sure similar things don't happen again. Meanwhile, we have implemented a multi-gpu version that should be online in the next few days, hope that will alleviate the training time issue.

Thanks for your detailed reply, I'll try again. ^-^.

About the training time, I run the low-resource experiment by using the given command: python run_cdm.py --override lr_all_config.yaml on Nvidia A100 (40G), in addition, I change the batchsize=4 to batchsize=1 to alleviate the OOM problem, maybe that is the reason why the running time much slower than you said?

I see, that might be it, if you don't really care about graph tasks, you can remove chemblpre from the task, which should speed up the training a lot.

Thanks for your advices and I tried again. ^-^.
The results on FB15K-237 dataset is impressively nice, but on the arXiv is not that efficient, is that reasonable? (I remove chemblpre from the task to speed up training.)

wandb: Run summary:
wandb:                                       epoch 50
wandb:                    test_FB15K237_fs_103/acc 0.91
wandb:  test_FB15K237_fs_103/loss/dataloader_idx_5 0.14584
wandb:                    test_FB15K237_fs_203/acc 0.846
wandb:  test_FB15K237_fs_203/loss/dataloader_idx_6 0.08583
wandb:                    test_FB15K237_fs_403/acc 0.759
wandb:  test_FB15K237_fs_403/loss/dataloader_idx_7 0.05838
wandb:                     test_FB15K237_fs_53/acc 0.941
wandb:   test_FB15K237_fs_53/loss/dataloader_idx_4 0.2567
wandb:                       test_WN18RR_fs_51/acc 0.396
wandb:     test_WN18RR_fs_51/loss/dataloader_idx_8 0.8227
wandb:                       test_WN18RR_fs_53/acc 0.454
wandb:     test_WN18RR_fs_53/loss/dataloader_idx_9 0.67437
wandb:                       test_WN18RR_fs_55/acc 0.493
wandb:    test_WN18RR_fs_55/loss/dataloader_idx_10 0.63413
wandb:                       test_arxiv_fs_103/acc 0.332
wandb:     test_arxiv_fs_103/loss/dataloader_idx_2 0.47099
wandb:                       test_arxiv_fs_203/acc 0.215
wandb:     test_arxiv_fs_203/loss/dataloader_idx_3 0.25016
wandb:                        test_arxiv_fs_33/acc 0.633
wandb:      test_arxiv_fs_33/loss/dataloader_idx_0 1.51754
wandb:                        test_arxiv_fs_53/acc 0.502
wandb:      test_arxiv_fs_53/loss/dataloader_idx_1 0.93272

Hi @karin0018 ,

Sorry for the late reply. Interestingly, we are getting better results on arxiv and worse results on FB, but I think the margins are reasonable. The most likely cause is that you used more ways for testing, which caused the split to be different, and consequently, the results are different. (few-shot experiments are sensitive to class split)

We just pushed a new version, if you'd like, you can pull and try again.