Tiiiger/bert_score

rescale and specify certain model

areejokaili opened this issue · 11 comments

Hi
Thank you for making your code available.
I have used your score before the last update (before muti-refs were possible and before scorer). I used to get the hash of the model to make sure I get the same results always.
With the new update, I'm struggling to find how to set a specific model and also rescale.

For example, would like to do like this
out, hash_code= score(preds, golds, model_type="roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)", rescale_with_baseline= True, return_hash=True)

roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0) is the hash I got from my earlier runs couple of months ago.

Appreciate your help
Areej

hi @areejokaili , sorry for the confusion.

The code below should meet your use case.

out, hash_code= score(preds, golds, model_type="roberta-large", rescale_with_baseline= True, return_hash=True)

roberta-large

Hi @Tiiiger, thanks for the quick reply.
Tried your provided code but It required lang='en'.

scorer = BERTScorer(model_type='roberta-large', lang='en', rescale_with_baseline=True)

It works now, but I'm getting different scores than before. I was doing my own multi-refs scoring before, so maybe this is why.
I'll investigate more

were you using baseline rescaling before? according to the hash you were not?

this is what I used before
score([p], [g], lang="en", verbose=False, rescale_with_baseline=True)
and this is the hash actually
roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)-rescaled

Cool, that looks correct. Let me know if you have any further question.

Hi @Tiiiger again,

sorry for asking again but I did a dummy test to compute the similarity between 'server' and 'cloud computing' using two different environments.

First env has bert-score 0.3.0, transformers 2.5.0 and got scores 0.379 0.209 0.289
hash --> roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)-rescaled

The second env, has bert-score 0.3.2, transformers 2.8.0 and got scores -0.092, -0.167 -0.128
hash --> roberta-large_L17_no-idf_version=0.3.2(hug_trans=2.8.0)-rescaled
In both cases I have used the following
P, R, F= score(preds, golds,lang='en', rescale_with_baseline=True, return_hash=True)
I would like to use bert-score 0.3.2 for the multi-refs feature but would like to maintain the same scores as I got before.
Would appreciate any insight why I'm not getting the same score

hi @areejokaili , thank you for letting me know. I suspect that there could be some bugs in the newer version and I would love to fix those.

I am looking into this.

hi I quickly tried a couple of environments. Here are the results:

> score(['server'], ['cloud computing'],lang='en', rescale_with_baseline=True, return_hash=True)
((tensor([-0.0919]), tensor([-0.1670]), tensor([-0.1279])),
 'roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.8.0)-rescaled')
> score(['server'], ['cloud computing'],lang='en', rescale_with_baseline=True, return_hash=True)
((tensor([0.3699]), tensor([0.2090]), tensor([0.2893])),
 'roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)-rescaled')

I believe this is due to an update in the RoBERTa tokenizer.

Running transformers=2.5.0, I got this warning:

RobertaTokenizerFast has an issue when working on mask language modeling where it introduces an extra encoded space before the mask token.See https://github.com/huggingface/transformers/pull/2778 for more information.

I encourage you to checkout issue 2778 to understand this change.

So, as I understand, this is not a change in our software. If you want to keep the same results as before, then you should downgrade transformers==2.5.0. However, I believe the behavior in transformer==2.8.0 is more correct. It's your call and it really depends on your use case.

Again, thank you for giving me the heads-up. I'll add a warning to our README.

Hi @Tiiiger
Thanks for letting me know. I have updated both libraries and will go with Transformers 2.8.0.
I have one more question and would appreciate clarifying what I'm missing here

cands=['I like lemons.']

refs = [['I am proud of you.','I love lemons.','Go go go.']]

(P, R, F), hash_code = score(cands, refs, lang="en", rescale_with_baseline=True, return_hash=True)
P, R, F = P.mean().item(), R.mean().item(), F.mean().item()

print(">", P, R, F)
print("manual F score:", (2 * P * R / (P + R)))

--- output ---

> 0.9023454785346985 0.9023522734642029 0.9025075435638428
manual F score: 0.9023488759866588

Do you know why the F score directly from the method is different than when I do it manually?
Thanks again

Hi @areejokaili,

The reason is that you are using rescale_with_baseline=True.
The raw F score is computed using the raw P and R, and then rescaled based on the F baseline score. P and R are also rescaled independently based on their own baseline scores as well.

Thanks @felixgwu
Could you check this please

cands=['I like lemons.', 'cloud computing']
refs = [['I am proud of you.','I love lemons.','Go go go.'],
        ['calculate this.','I love lemons.','Go go go.']]
print("number of cands and ref are", len(cands), len(refs))
(P,R,F), hash_code = score(cands, refs, lang="en", rescale_with_baseline=False, return_hash=True)
P, R, F = P.mean().item(), R.mean().item(), F.mean().item()

print(">", P, R, F)
print("manual F score:", (2 * P * R / (P + R)))

output

> 0.9152767062187195 0.9415446519851685 0.9280155897140503
manual F score: 0.9282248763666026

Appreciate the help,