Very low log-probabilities (and therefore high surprisals) for a grammatical sentence
Closed this issue · 5 comments
Hi,
This issue is a question. I'm using the German GPT-2 model dbmdz/german-gpt2
to get log-probability and surprisal scores for each token.
The log-probabilities are quite low given that that the sentence is a grammatical German sentence. Below my code and a comparison between a German and an English sentence with the same model.
from minicons import scorer
gpt_model_scorer = scorer.IncrementalLMScorer("dbmdz/german-gpt2", "cpu")
log_probs = gpt_model_scorer.token_score(["Der Mensch sammelt die unterschiedlichsten Gegenstände."])
# results for German
# ('Der', 0.0)
# ('Mensch', -102.05498504638672)
# ('sammelt', -101.456787109375)
# ('die', -95.31419372558594)
# ('unterschiedlichsten', -98.86357116699219)
# ('Gegenstände', -89.14930725097656)
# ('.', -88.57086181640625)
log_probs_english = gpt_model_scorer.token_score(["The man collects various items."])
# results for English
# ('The', 0.0)
# ('man', -95.54633331298828)
# ('colle', -68.77188110351562)
# ('cts', -36.030174255371094)
# ('v', -87.14112854003906)
#('ario', -44.695987701416016)
# ('us', -45.50498962402344)
#('items', -79.1251449584961)
# ('.', -73.40552520751953)
Am I using the your code in the intended way? Is the issue the GPT-2 model?
Hi - thanks for the reproducible code! You indeed are using the library correctly. If we are to assume minicons does not have a bug, then this might be expected behavior. I checked by replacing the "die" before "unterschiedlichsten" with "der" (which I am assuming is the wrong gender, thereby making the sentence not so grammatical) and we see the scores become worse:
wrong_log_probs = gpt_model_scorer.token_score(["Der Mensch sammelt der unterschiedlichsten Gegenstände."])
wrong_log_probs
# [[('Der', 0.0),
# ('Mensch', -101.90458679199219),
# ('sammelt', -101.98616790771484),
# ('der', -100.152099609375),
# ('unterschiedlichsten', -102.6846694946289),
# ('Gegenstände', -92.48149871826172),
# ('.', -91.42720031738281)]]
which leads me to conclude it might just be a model thing.
I am going to run more checks when I have the bandwidth but let me know if this makes sense! Thanks for using minicons :)
Hi,
Thank you for replying so quickly.
Indeed "der" would be incorrect. But what is not expected is that the English sentence has higher log-probabilities, right?
Yes, that is totally not expected and is indeed very surprising. Was this model fine-tuned from an english corpus?
You can also do a sanity check by checking the ranks of each token (ranked based on log-probs):
e.g., the 'die' stimuli has much favorable ranks of subsequent words than does the 'der' stimuli:
gpt_model_scorer.token_score(["Der Mensch sammelt die unterschiedlichsten Gegenstände."], rank=True)
'''OUTPUT:
[[('Der', 0.0, 0),
('Mensch', -104.85952758789062, 25),
('sammelt', -104.52547454833984, 759),
('die', -98.28511047363281, 2),
('unterschiedlichsten', -101.32772827148438, 98),
('Gegenstände', -91.16206359863281, 9),
('.', -91.17868041992188, 4)]]
'''
gpt_model_scorer.token_score(["Der Mensch sammelt der unterschiedlichsten Gegenstände."], rank=True)
'''OUTPUT:
[[('Der', 0.0, 0),
('Mensch', -104.85952758789062, 25),
('sammelt', -104.52547454833984, 759),
('der', -102.77133178710938, 135),
('unterschiedlichsten', -105.02896881103516, 6482),
('Gegenstände', -94.78482818603516, 27),
('.', -93.73509216308594, 4)]]
'''
I will try to manually check without minicons soon, but cannot guarantee how soon :P
Hi @izaskr -- it seems like the reply to the issue in the model's repo explains the observed behavior? would it be ok if I closed this issue then?
closing for now -- feel free to reopen if you find minicons-specific issues!