jacobswan1/Video2Commonsense

PPL indicator

Muccul opened this issue · 2 comments

cmses = cms_list[random_id].split(';')[1:]
res[eval_id] = [cms]
gts[eval_id] = cmses
eval_id += 1
ppl_corpus = ''
for c in cmses:
total_cms.add(c.lower())
ppl_corpus += ' ' + c.lower()
tokens = nltk.word_tokenize(ppl_corpus)
unigram_model = unigram(tokens)
ppl_scores.append(perplexity(c.lower(), unigram_model))
# Compute PPL score
print('Perplexity score: ', sum(ppl_scores)/len(ppl_scores))

hi! @jacobswan1
This PPL index calculates the index of a certain GT, not the PPL of the predicted sentence.
Why does this code work out as a value on paper?

Hi @Muccul, thanks for pointing this error out. Indeed I just ran a quick test and found it does give the wrong PPL score.

I'll address it and update the arxiv table accordingly later, while I think it does not invalidate the paper’s other claims yet.

please stay tuned and I'm working on getting the correct ppl scores and update it soon.

ps: Many thanks again for letting me know! And also thanks Weijiang for forwarding this issue to me.

Hi @Muccul, I've updated both PPL eval codes, also together with other re-implemented baseline numbers, and pre-trained checkpoint.
Again, I appreciate your help in pointing out the bug, and many thanks for your help. Many thanks!