Aleph-Alpha/magma

Reproducing results from your paper

Golovneva opened this issue · 2 comments

Hi! Thank you for sharing the code for your model.
I'm having troubles to reproduce the results you have published in your paper.
Here are the scores I get on Coco dataset using the checkpoint provided in your paper:

{'Bleu_1': 0.22440850959728406, 'Bleu_2': 0.11753228266783161, 'Bleu_3': 0.06043320902662557, 'Bleu_4': 0.0321128847993337, 'METEOR': 0.09099773362803487, 'ROUGE_L': 0.16770810280576667, 'CIDEr': 0.11203192991375235}

As you can see, they are significantly lower. I'm using nlg-eval package as you have mentioned here.

What model does your checkpoint corresponds to in the paper, base or long? How do you initialize it for evaluations? Here is my setup:

model = Magma.from_checkpoint(
config_path=os.path.join(model_path, "configs/MAGMA_v1.yml"),
checkpoint_path="mp_rank_00_model_states.pt",
device="cuda:0",
)

As a prompt message I'm using "A picture of " - is that correct?
I'm using temperature=0.7, and also setting manual seed in torch to 42.

Is there anything I'm missing or doing wrong here? If everything looks fine, could you please share your evaluation scripts that would reproduce results?

Hi,

thanks for your interest in our work. Regarding your questions:

  • The checkpoint provided in this repo is only a demo checkpoint, although configuration wise it is the same as the FF 4 adapter ablation we reported in the paper. I never evaluated this specific checkpoint. MAGMA base and long are trained on an entirely different dataset and are not really comparable.
  • For the prompt instruction it is very important to NOT have a whitespace at the end, so it should be "A picture of" instead of "A picture of "
  • Temperature 0.7 is maybe a bit high for eval, I don't exactly recall what we used but I would not go higher than 0.1. Greedy decoding (temp. 0.0) should also work I think.

Please understand that I cannot share further resources, although using nlg-eval should be rather straight forward. With lower temperature and without whitespace at the end of the prompt I would expect results to improve.

Best,

Constantin

Thank you! Changing prompt and temperature helped to improve scores, although still lower than reported in the paper:
{'Bleu_1': 0.39627885456680473, 'Bleu_2': 0.2550440831428488, 'Bleu_3': 0.16363634146091482, 'Bleu_4': 0.10697364143823233, 'METEOR': 0.1581395520589721, 'ROUGE_L': 0.3094927895838829, 'CIDEr': 0.3734865136851766}