yooper/php-text-analysis

False IDF calculation

leik-software opened this issue · 2 comments

I think the idf value in \TextAnalysis\Indexes\TfIdf::buildIndex is calculated wrong. With my example I get only zero values. As shown in this article https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/ the calculation in line 50 should be:
$value = 1+log(($count)/($value));
(add 1 to log())

@leik-software , would you be able to supply a test case, proving its incorrectness?

Thank you,

I have a case where I have just one document, then the calculation would look like this (without 1 added):
$value = log(($count)/($value)); $value = log(1/1);
$value = log(1); $value = 0;

With this zero result, I need to calculate the cosine similarity where I will divide with zero. Therefore 1 should be added to avoid this exception. But this is just my case, I found examples with and without adding 1. I close this issue again.