rescale and specify certain model
areejokaili opened this issue · 11 comments
Hi
Thank you for making your code available.
I have used your score before the last update (before muti-refs were possible and before scorer). I used to get the hash of the model to make sure I get the same results always.
With the new update, I'm struggling to find how to set a specific model and also rescale.
For example, would like to do like this
out, hash_code= score(preds, golds, model_type="roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)", rescale_with_baseline= True, return_hash=True)
roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0) is the hash I got from my earlier runs couple of months ago.
Appreciate your help
Areej
hi @areejokaili , sorry for the confusion.
The code below should meet your use case.
out, hash_code= score(preds, golds, model_type="roberta-large", rescale_with_baseline= True, return_hash=True)
roberta-large
Hi @Tiiiger, thanks for the quick reply.
Tried your provided code but It required lang='en'.
scorer = BERTScorer(model_type='roberta-large', lang='en', rescale_with_baseline=True)
It works now, but I'm getting different scores than before. I was doing my own multi-refs scoring before, so maybe this is why.
I'll investigate more
were you using baseline rescaling before? according to the hash you were not?
this is what I used before
score([p], [g], lang="en", verbose=False, rescale_with_baseline=True)
and this is the hash actually
roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)-rescaled
Cool, that looks correct. Let me know if you have any further question.
Hi @Tiiiger again,
sorry for asking again but I did a dummy test to compute the similarity between 'server' and 'cloud computing' using two different environments.
First env has bert-score 0.3.0, transformers 2.5.0 and got scores 0.379 0.209 0.289
hash --> roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)-rescaled
The second env, has bert-score 0.3.2, transformers 2.8.0 and got scores -0.092, -0.167 -0.128
hash --> roberta-large_L17_no-idf_version=0.3.2(hug_trans=2.8.0)-rescaled
In both cases I have used the following
P, R, F= score(preds, golds,lang='en', rescale_with_baseline=True, return_hash=True)
I would like to use bert-score 0.3.2 for the multi-refs feature but would like to maintain the same scores as I got before.
Would appreciate any insight why I'm not getting the same score
hi @areejokaili , thank you for letting me know. I suspect that there could be some bugs in the newer version and I would love to fix those.
I am looking into this.
hi I quickly tried a couple of environments. Here are the results:
> score(['server'], ['cloud computing'],lang='en', rescale_with_baseline=True, return_hash=True)
((tensor([-0.0919]), tensor([-0.1670]), tensor([-0.1279])),
'roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.8.0)-rescaled')
> score(['server'], ['cloud computing'],lang='en', rescale_with_baseline=True, return_hash=True)
((tensor([0.3699]), tensor([0.2090]), tensor([0.2893])),
'roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)-rescaled')
I believe this is due to an update in the RoBERTa tokenizer.
Running transformers=2.5.0
, I got this warning:
RobertaTokenizerFast has an issue when working on mask language modeling where it introduces an extra encoded space before the mask token.See https://github.com/huggingface/transformers/pull/2778 for more information.
I encourage you to checkout issue 2778 to understand this change.
So, as I understand, this is not a change in our software. If you want to keep the same results as before, then you should downgrade transformers==2.5.0
. However, I believe the behavior in transformer==2.8.0
is more correct. It's your call and it really depends on your use case.
Again, thank you for giving me the heads-up. I'll add a warning to our README.
Hi @Tiiiger
Thanks for letting me know. I have updated both libraries and will go with Transformers 2.8.0.
I have one more question and would appreciate clarifying what I'm missing here
cands=['I like lemons.']
refs = [['I am proud of you.','I love lemons.','Go go go.']]
(P, R, F), hash_code = score(cands, refs, lang="en", rescale_with_baseline=True, return_hash=True)
P, R, F = P.mean().item(), R.mean().item(), F.mean().item()
print(">", P, R, F)
print("manual F score:", (2 * P * R / (P + R)))
--- output ---
> 0.9023454785346985 0.9023522734642029 0.9025075435638428
manual F score: 0.9023488759866588
Do you know why the F score directly from the method is different than when I do it manually?
Thanks again
Hi @areejokaili,
The reason is that you are using rescale_with_baseline=True
.
The raw F score is computed using the raw P and R, and then rescaled based on the F baseline score. P and R are also rescaled independently based on their own baseline scores as well.
Thanks @felixgwu
Could you check this please
cands=['I like lemons.', 'cloud computing']
refs = [['I am proud of you.','I love lemons.','Go go go.'],
['calculate this.','I love lemons.','Go go go.']]
print("number of cands and ref are", len(cands), len(refs))
(P,R,F), hash_code = score(cands, refs, lang="en", rescale_with_baseline=False, return_hash=True)
P, R, F = P.mean().item(), R.mean().item(), F.mean().item()
print(">", P, R, F)
print("manual F score:", (2 * P * R / (P + R)))
output
> 0.9152767062187195 0.9415446519851685 0.9280155897140503
manual F score: 0.9282248763666026
Appreciate the help,