Question about replicating the MAUVE scores
XiangLi1999 opened this issue · 3 comments
Hi,
Cool work! Could you point me to the code or MAUVE hyper-parameter setup that you used to compute the MAUVE score? I think using the default mauve hyper parameter and the released simctg_contrasive.json yields a MAUVE score of 0.035, so I am trying to figure out whether I did something wrong...
Thanks!
Hi Lisa @XiangLi1999,
If I recall correctly, I use the following evaluation setup
out = mauve.compute_mauve(p_text=ref_list, q_text=pred_list, device_id=2, max_text_length=256,
verbose=False, featurize_model_name='gpt2')
print (out.mauve)
where the ref_list is the list of human-written text and pred_list is the list of generated text from the model.
However, there is one key thing you should take is that every text in the ref_list should have the same length (after tokenization) as the text in the pred_list (i.e., 128 in our case). That being said, you should use the GPT tokenizer to tokenize the reference text as provided in the simctg_contrasive.json
file and only keep the first 128 tokens to make a comparison with the 128-length text predicted by the model.
If the lengths of reference text and predicted text are not the same, the MAUVE score will be extremely low no matter which decoding method you use. I think the authors of MAUVE did not mention this explicitly in their work.
Also, if you are interested in replicating the rep-n and diversity scores, you can refer to [here].
Hope my reply helps! Please let me know if you have further questions :-)
Hi, it seems that the length of the reference continuation in simctg_contrastive.json is not always 128...
So could you kindly tell me how to fix it?