(Document Classification) How to get a document score with KenLM?

Question

(Document Classification) How to get a document score with KenLM?

piegu opened this issue 2 years ago · 3 comments

Many thanks for your work!
I would like to use KenLM for Document Classification (I have the text files of documents and the documents belong to 10 categories).

My idea is to train by category a KenLM model (10, here) and then, score a new document through the 10 KenLM models.
Thus, the "best" score will give the category of the document.

My question is: what would be this "document score" and how to calculate it from the text file of a document?

Answer 1 · 2022-05-27T11:42:00.000Z

Sentences are independent. You probably want perplexity per word https://en.wikipedia.org/wiki/Perplexity computed from the 10 models. Note that, especially if training data size is unbalanced, you will want to tune thresholds rather than simply take the lowest perplexity per word.

Answer 2 · 2022-05-27T13:13:52.000Z

You probably want perplexity per word computed from the 10 models

@kpu, hank you for your answer but I don't know how to apply it in practice.
What is the KenLM code you would use?

This following one?

import numpy as np
import kenlm

m1 = kenlm.Model('kenlm_model1.arpa')
m2 = kenlm.Model('kenlm_model2.arpa')

# text with 2 sentences
text = 'My idea is to train by category a KenLM model (10, here) and then, score a new document through the 10 KenLM models. Thus, the "best" score will give the category of the document.'

ppl1 = m1.perplexity(text)
ppl2 = m2.perplexity(text)

if ppl1 <= ppl2:
   print("the text class is 1")
else:
  print("the text class is 2")

Answer 3 · 2022-06-01T20:27:53.000Z

Hi,

Sentences are independent. You probably want perplexity per word https://en.wikipedia.org/wiki/Perplexity computed from the 10 models.

From this paragraph about perplexity of the Stanford document "N-gram Language Models" (see footer note), when there are more than 2 sentences, just put </s> <s> between them, and calculate the perplexity of the whole text (what @kpu calls perplexity per word in its post above).

Note: I found interesting this blog post about "Can you compare perplexity across different segmentations?" in order to understand how is calculated the perplexity per word of a sentence.

Thus, my pseudo code becomes:

import numpy as np
import kenlm

m1 = kenlm.Model('kenlm_model1.arpa')
m2 = kenlm.Model('kenlm_model2.arpa')

# text with 2 sentences
text_original = 'My idea is to train by category a KenLM model (10, here) and then, score a new document through the 10 KenLM models. Thus, the "best" score will give the category of the document.'

text_modificated = 'My idea is to train by category a KenLM model ( 10 , here ) and then, score a new document through the 10 KenLM models . </s> <s> Thus , the " best " score will give the category of the document .'

ppl1 = m1.perplexity(text_modificated)
ppl2 = m2.perplexity(text_modificated)

# without tuning thresholds
if ppl1 <= ppl2:
   print("the text class is 1")
else:
  print("the text class is 2")

Note that, especially if training data size is unbalanced, you will want to tune thresholds rather than simply take the lowest perplexity per word.

@kpu: how do you tune thresholds for a multiclass classification problem?

Footer note on the paragraph about perplexity of the Stanford document "N-gram Language Models": "Since this sequence will cross many sentence boundaries, we need to include the begin- and end-sentence markers <s> and </s> in the probability computation. We also need to include the end-of-sentence marker (but not the beginning-of-sentence marker <s>) in the total count of word tokens N."