Tiiiger/bert_score

Different hug_trans version cause different BertScore

dzf0023 opened this issue · 6 comments

Dear Author,

I would like to ask why using a different hug_trans version, even for the same model and reference/prediction, the BertScore would be different.

For instance, given prediction = "aaaaaa", reference = "hello there", using the default model (roberta-large_L17_no-idf_version=0.3.12), if we use hug_trans=4.30.2, the BertScore_F1 would be 0.80 while using hug_trans=4.24.0, the BertScore_F1 would be 0.238, for the same input.

Meanwhile, using identical prediction and reference as input, BertScore_F1 with hug_trans = 4.24.0 not giving "1" as the result.
Although the original paper mentioned that they calculated a random BertScore to do the baseline rescaling, this random score should be very small, thus it's a little tricky to understand why the significant gap is shown.

Thank you so much for your contribution and your time to answer our questions.

Hi @dzf0023,

I ran following code with both transformers=4.30.2 and transformers=4.24.0 as you described.

from bert_score import score

pred = "aaaaaa"
ref = "hello there"

print(score([pred], [ref], lang='en'))
print(score([pred], [ref], lang='en', rescale_with_baseline=True))

However, I got identical outputs as follows (with two different versions).

no rescaling: (tensor([0.7637]), tensor([0.8599]), tensor([0.8089]))
with rescaling: (tensor([-0.4024]), tensor([0.1683]), tensor([-0.1321]))

Can you double check if you call the function with the same inputs?

Hi @felixgwu,

Thank you so much for your quick response. I basically tried the metric from hugging face website: https://huggingface.co/spaces/evaluate-metric/bertscore
Below are the codes:

from evaluate import load
bertscore = load("bertscore")

pred = "aaaaaa"
ref = "hello there"
results = bertscore.compute(predictions=pred, references=ref, lang='en')

It still gives me different results for different "hug_trans" versions.

hug_trans = 4.30.2 , result is (tensor([0.7637]), tensor([0.8599]), tensor([0.8089]))
hug_trans = 4.24.0  , result is (tensor([0.1998]), tensor([0.2959]), tensor([0.2385]))

I see previously you answered that BertScore can change with different huggingface's transformers versions and you would look into it. It is because of this reason? https://github.com/Tiiiger/bert_score/issues/143#issuecomment-1327988420
.

Please find my screenshots below for two hashcode:
colab

and

Server

Unfortunately, I'm still not able to get 0.24 as what you showed. Here is what I got with 4.24.0 version. There might be some bugs in some other libraries that I don't know.
image

Thank you so much for your time. May I know which library you are using? I can definitely go and check the source code of how they implement the transformer.

Also, this is very interesting to me. Intuitively, when we compare "aaaaaa" Vs "hello there", which is a very different pair, in this case, BertScore still gives a high score, "0.808". (which I feel it should not). From your original paper, I understand most of the experiments are MT, so I referred to another benchmark paper "SummEval: Re-evaluating Summarization Evaluation." https://arxiv.org/abs/2007.12626 since I am more focused on summarization task. In this paper, Table 4, their average BertScore for most AI models such as T5 is just 0.4450. I am curious, even for many intelligent/big models such as T5 just achieves an average BertScore of 0.445, how could a very different pair get 0.808? Please check another case I used below:

Screen Shot 2023-06-22 at 9 09 16 PM

these ref/generation pair are even more different, using 4.24.0 still gets BertScore_F1 as 0.77

Thus, my feeling is, the version that gives your very high score, is tend to give a relatively high score; that is, the lower bound is very high.

Another follow-up question is, if the high/low of the BertScore is not that important, should we care more about the human correlation instead of being more focused on the score itself? Since I assume different transformers have different embedding understandings which cause the variable results.

Please find the screenshot of the SummEval paper table below:

Screen Shot 2023-06-22 at 9 16 10 PM

You can find all the libraries in https://github.com/Tiiiger/bert_score/blob/master/requirements.txt
But I think you may first use the the environment that you have transformers=4.30.2 and run pip install transformers==4.24.0 to downgrade it and check if you can also get 0.80 with the older version and then you can compare the other libraries in these two environment.

0.80 is a reasonable BERTScore when having roberta-large_L17. As you can see in this file, the BERTScore between two random sentences is about 0.83. This is why we recommend using rescale_with_baseline=True that gives you -0.1321. For more detailed explanation of this, we see our post.

Thank you so much for your suggestions! I will try it asap.