Question about replicating the MAUVE scores

Question

Question about replicating the MAUVE scores

XiangLi1999 opened this issue 2 years ago · 3 comments

XiangLi1999 commented 2 years ago

Hi,

Cool work! Could you point me to the code or MAUVE hyper-parameter setup that you used to compute the MAUVE score? I think using the default mauve hyper parameter and the released simctg_contrasive.json yields a MAUVE score of 0.035, so I am trying to figure out whether I did something wrong...

Thanks!

Answer 1 · 2022-07-01T09:49:34.000Z

Hi Lisa @XiangLi1999,

If I recall correctly, I use the following evaluation setup

out = mauve.compute_mauve(p_text=ref_list, q_text=pred_list, device_id=2, max_text_length=256, 
        verbose=False, featurize_model_name='gpt2')
print (out.mauve)

where the ref_list is the list of human-written text and pred_list is the list of generated text from the model.

However, there is one key thing you should take is that every text in the ref_list should have the same length (after tokenization) as the text in the pred_list (i.e., 128 in our case). That being said, you should use the GPT tokenizer to tokenize the reference text as provided in the simctg_contrasive.json file and only keep the first 128 tokens to make a comparison with the 128-length text predicted by the model.

If the lengths of reference text and predicted text are not the same, the MAUVE score will be extremely low no matter which decoding method you use. I think the authors of MAUVE did not mention this explicitly in their work.

Also, if you are interested in replicating the rep-n and diversity scores, you can refer to [here].

Hope my reply helps! Please let me know if you have further questions :-)

Answer 2 · 2022-09-21T15:40:06.000Z

Hi, it seems that the length of the reference continuation in simctg_contrastive.json is not always 128...
So could you kindly tell me how to fix it?

Answer 3 · 2022-09-21T15:55:31.000Z

Hi @LHRYANG,

When evaluating MAUVE, please truncate the reference text to its first 128 tokens. The length of the predictions should be always 128.

Please let me know if you have any further questions :-)