Adding MoverScore

Question

Adding MoverScore

forrestbao opened this issue 2 years ago · 5 comments

Like BertScore and BLEURT, MoverScore is another modern transformer-based reference-based summerization metric.

However, we did not include it in our pilot study. Now maybe a good time to add it.

Unfortunately, HF's evaluate library does not included it. But the original author seems to have provided a good package: https://pypi.org/project/moverscore/ And the Github source is here: https://github.com/AIPHES/emnlp19-moverscore

Let's add it. Note to be fair and square (#10), let's use a RoBERTa-large based model. To select a model, see here. The model name is the model name in HuggingFace. So we can simply use RoBERTa-large (generally trained).

Answer 1 · 2022-12-09T12:50:17.000Z

Is this to be included in evalbase or DocAsRef?

Answer 2 · 2022-12-09T22:19:49.000Z

I think DocAsRef.
How do did you run your experiments? I think you should define your metrics, and the import EvalBase's top level functions to evaluate, do you?

Answer 3 · 2022-12-10T06:17:28.000Z

I think MoverScore should be added into DocAsRef. Any metrics developed or benchmarked by us for the ACL 2023 submission should go to DocAsRef.

To evaluate, just go to EvalBase/env.py, import metrics from DocAsRef folder, and then add the imported metrics in the metrics dictionary. Then run EvalBase's experiment files, i.e., {summeval, realsumm, newsroom}.py

Answer 4 · 2022-12-10T13:26:11.000Z

moverscore_v2 runs on CPU. moverscore runs on GPU but unable to change model.

Refs:
https://github.com/AIPHES/emnlp19-moverscore/blob/master/moverscore.py
https://github.com/AIPHES/emnlp19-moverscore/blob/master/moverscore_v2.py

Answer 5 · 2022-12-10T16:58:21.000Z

It's because MoverScore_v2 does not move variables to GPU.
Compare all lines in Moverscore that has device=device vs. those in MoverScore_v2 without this kwarg.
Or search for cuda:0 or .to(device in MoverScore code.
Such as the lines below in MoverScore

    padded, lens, mask = padding(arr, pad_token, dtype=torch.long)
    padded_idf, _, _ = padding(idf_weights, pad_token, dtype=torch.float)


    padded = padded.to(device=device)
    mask = mask.to(device=device)
    lens = lens.to(device=device)
    return padded, padded_idf, lens, mask, tokens

vs. the lines below in MoverScore_v2

    padded, lens, mask = padding(arr, pad_token, dtype=torch.long)
    padded_idf, _, _ = padding(idf_weights, pad_token, dtype=torch.float)

    return padded, padded_idf, lens, mask, tokens

Maybe you can give them a PR.