yooper/php-text-analysis

how can I use this code for finding text similarity?

Closed this issue · 5 comments

mrmrn commented

hi
I am searching for a piece of code to simply finding similarity between to comments. each comments have 100-300 words.
how can I use this code for cosine similarity or any other method for finding text similarity.
my texts are in persian language, does it matter?

thank you.

My approach would be to use cosine similarity. For improved results you may want to use a stemmer on each token, before passing the tokens into the cosine similarity call.

use TextAnalysis\Comparisons\CosineSimilarityComparison;
use TextAnalysis\Tokenizers\GeneralTokenizer;

<?php
        $tokenizer = new GeneralTokenizer();
        $text1 = $tokenizer->tokenize($comment1)
        $text2 = $tokenizer->tokenize($comment2)  
        $compare = new CosineSimilarityComparison();
        // returns score 1.0 to 0.   1.0 is an identical match
        $result = $compare->similarity($text1, $text2);

Any feedback you have about the library would be helpful.

Thanks,

mrmrn commented

wow, this is very helpful. I was using this code for similarity finding:

<?php
similar_text($var_1, $var_2, $percent);

echo 'similarity'.$percent;

:))
I will try your code and report the output.
than you for your fast response.

mrmrn commented

In my case, similar_text is very closer to real results against this code.
for example:

$var_1 = 'زندگی #شهریار از زبان هوشنگ #ابتهاج :)))



📡 ';
$var_2 = 'زندگی شهریار از زبان هوشنگ ابتهاج:))) 
حتما بخون


🐾به مستر پیشی بپیوندید:
';

similar_text says they have 68.8995215311similarity but CosineSimilarityComparison says thy are 40.824829046386.

also with this variables:

$var_1 = 'زندگی #شهریار از زبان هوشنگ #ابتهاج :)))';
$var_2 = 'زندگی شهریار از زبان هوشنگ ابتهاج:))) 
حتما بخون
';

similar_text says they have 84.967320261438 similarity but CosineSimilarityComparison says thy are 53.452248382485.

and in this later case they are really similar!

You will want to normalize your text by performing preprocessing. Try removing the punctuation, newlines, and stop words.

The code below removes punctuation:

<?php
use TextAnalysis\Comparisons\CosineSimilarityComparison;
use TextAnalysis\Tokenizers\GeneralTokenizer;
use TextAnalysis\Filters\PunctuationFilter;

$tokenizer = new GeneralTokenizer();
$tokens1 = $tokenizer->tokenize('زندگی #شهریار از زبان هوشنگ #ابتهاج :)))');
$tokens2 = $tokenizer->tokenize('زندگی شهریار از زبان هوشنگ ابتهاج:))) 
حتما بخون
');
$filter = new PunctuationFilter();

$filteredTokens1 = [];
foreach($tokens as $token)
{
    $filteredTokens1[] = $filter->transform($token);
}

$filteredTokens2 = [];
foreach($tokens2 as $token)
{
    $filteredTokens2[] = $filter->transform($token);
}

// afterwards filter out the empty or null values from the $filteredTokens array

$compare = new CosineSimilarityComparison();
// returns score 1.0 to 0.   1.0 is an identical match
        $result = $compare->similarity($filteredTokens1 , $filteredTokens2 );

mrmrn commented

Yes, with some text purification like removing emojis, numbers, punctuations and links the outputs are almost close.
how ever I dont know how similar function works in php but it makes sense better at present.
but I will fork this code for persian language as soon as possible and I think it will be very usefull.
thank you very much.