kanishkamisra/minicons

Very low log-probabilities (and therefore high surprisals) for a grammatical sentence

Closed this issue · 5 comments

Hi,

This issue is a question. I'm using the German GPT-2 model dbmdz/german-gpt2 to get log-probability and surprisal scores for each token.

The log-probabilities are quite low given that that the sentence is a grammatical German sentence. Below my code and a comparison between a German and an English sentence with the same model.

from minicons import scorer

gpt_model_scorer = scorer.IncrementalLMScorer("dbmdz/german-gpt2", "cpu")

log_probs = gpt_model_scorer.token_score(["Der Mensch sammelt die unterschiedlichsten Gegenstände."])
# results for German
# ('Der', 0.0)
# ('Mensch', -102.05498504638672)
# ('sammelt', -101.456787109375)
# ('die', -95.31419372558594)
# ('unterschiedlichsten', -98.86357116699219)
# ('Gegenstände', -89.14930725097656)
# ('.', -88.57086181640625)

log_probs_english = gpt_model_scorer.token_score(["The man collects various items."])
# results for English
# ('The', 0.0)
# ('man', -95.54633331298828)
# ('colle', -68.77188110351562)
# ('cts', -36.030174255371094)
# ('v', -87.14112854003906)
#('ario', -44.695987701416016)
# ('us', -45.50498962402344)
#('items', -79.1251449584961)
# ('.', -73.40552520751953)

Am I using the your code in the intended way? Is the issue the GPT-2 model?

Hi - thanks for the reproducible code! You indeed are using the library correctly. If we are to assume minicons does not have a bug, then this might be expected behavior. I checked by replacing the "die" before "unterschiedlichsten" with "der" (which I am assuming is the wrong gender, thereby making the sentence not so grammatical) and we see the scores become worse:

wrong_log_probs = gpt_model_scorer.token_score(["Der Mensch sammelt der unterschiedlichsten Gegenstände."])

wrong_log_probs

# [[('Der', 0.0),
#  ('Mensch', -101.90458679199219),
#  ('sammelt', -101.98616790771484),
#  ('der', -100.152099609375),
#  ('unterschiedlichsten', -102.6846694946289),
#  ('Gegenstände', -92.48149871826172),
#  ('.', -91.42720031738281)]]

which leads me to conclude it might just be a model thing.

I am going to run more checks when I have the bandwidth but let me know if this makes sense! Thanks for using minicons :)

Hi,
Thank you for replying so quickly.

Indeed "der" would be incorrect. But what is not expected is that the English sentence has higher log-probabilities, right?

Yes, that is totally not expected and is indeed very surprising. Was this model fine-tuned from an english corpus?

You can also do a sanity check by checking the ranks of each token (ranked based on log-probs):

e.g., the 'die' stimuli has much favorable ranks of subsequent words than does the 'der' stimuli:

gpt_model_scorer.token_score(["Der Mensch sammelt die unterschiedlichsten Gegenstände."], rank=True)

'''OUTPUT:
[[('Der', 0.0, 0),
  ('Mensch', -104.85952758789062, 25),
  ('sammelt', -104.52547454833984, 759),
  ('die', -98.28511047363281, 2),
  ('unterschiedlichsten', -101.32772827148438, 98),
  ('Gegenstände', -91.16206359863281, 9),
  ('.', -91.17868041992188, 4)]]
'''

gpt_model_scorer.token_score(["Der Mensch sammelt der unterschiedlichsten Gegenstände."], rank=True)

'''OUTPUT:
[[('Der', 0.0, 0),
  ('Mensch', -104.85952758789062, 25),
  ('sammelt', -104.52547454833984, 759),
  ('der', -102.77133178710938, 135),
  ('unterschiedlichsten', -105.02896881103516, 6482),
  ('Gegenstände', -94.78482818603516, 27),
  ('.', -93.73509216308594, 4)]]
'''

I will try to manually check without minicons soon, but cannot guarantee how soon :P

Hi @izaskr -- it seems like the reply to the issue in the model's repo explains the observed behavior? would it be ok if I closed this issue then?

closing for now -- feel free to reopen if you find minicons-specific issues!