how can I use this code for finding text similarity?

Question

how can I use this code for finding text similarity?

Closed this issue 7 years ago · 5 comments

hi
I am searching for a piece of code to simply finding similarity between to comments. each comments have 100-300 words.
how can I use this code for cosine similarity or any other method for finding text similarity.
my texts are in persian language, does it matter?

thank you.

Answer 1 · 2017-10-06T18:46:48.000Z

My approach would be to use cosine similarity. For improved results you may want to use a stemmer on each token, before passing the tokens into the cosine similarity call.

use TextAnalysis\Comparisons\CosineSimilarityComparison;
use TextAnalysis\Tokenizers\GeneralTokenizer;

<?php
        $tokenizer = new GeneralTokenizer();
        $text1 = $tokenizer->tokenize($comment1)
        $text2 = $tokenizer->tokenize($comment2)  
        $compare = new CosineSimilarityComparison();
        // returns score 1.0 to 0.   1.0 is an identical match
        $result = $compare->similarity($text1, $text2);

Any feedback you have about the library would be helpful.

Thanks,

Answer 2 · 2017-10-06T19:02:52.000Z

wow, this is very helpful. I was using this code for similarity finding:

<?php
similar_text($var_1, $var_2, $percent);

echo 'similarity'.$percent;

:))
I will try your code and report the output.
than you for your fast response.

Answer 3 · 2017-10-09T05:48:50.000Z

In my case, similar_text is very closer to real results against this code.
for example:

$var_1 = 'زندگی #شهریار از زبان هوشنگ #ابتهاج :)))



📡 ';
$var_2 = 'زندگی شهریار از زبان هوشنگ ابتهاج:))) 
حتما بخون


🐾به مستر پیشی بپیوندید:
';

similar_text says they have 68.8995215311similarity but CosineSimilarityComparison says thy are 40.824829046386.

also with this variables:

$var_1 = 'زندگی #شهریار از زبان هوشنگ #ابتهاج :)))';
$var_2 = 'زندگی شهریار از زبان هوشنگ ابتهاج:))) 
حتما بخون
';

similar_text says they have 84.967320261438 similarity but CosineSimilarityComparison says thy are 53.452248382485.

and in this later case they are really similar!

Answer 4 · 2017-10-09T12:42:24.000Z

You will want to normalize your text by performing preprocessing. Try removing the punctuation, newlines, and stop words.

The code below removes punctuation:

<?php
use TextAnalysis\Comparisons\CosineSimilarityComparison;
use TextAnalysis\Tokenizers\GeneralTokenizer;
use TextAnalysis\Filters\PunctuationFilter;

$tokenizer = new GeneralTokenizer();
$tokens1 = $tokenizer->tokenize('زندگی #شهریار از زبان هوشنگ #ابتهاج :)))');
$tokens2 = $tokenizer->tokenize('زندگی شهریار از زبان هوشنگ ابتهاج:))) 
حتما بخون
');
$filter = new PunctuationFilter();

$filteredTokens1 = [];
foreach($tokens as $token)
{
    $filteredTokens1[] = $filter->transform($token);
}

$filteredTokens2 = [];
foreach($tokens2 as $token)
{
    $filteredTokens2[] = $filter->transform($token);
}

// afterwards filter out the empty or null values from the $filteredTokens array

$compare = new CosineSimilarityComparison();
// returns score 1.0 to 0.   1.0 is an identical match
        $result = $compare->similarity($filteredTokens1 , $filteredTokens2 );

Answer 5 · 2017-10-09T15:53:07.000Z

Yes, with some text purification like removing emojis, numbers, punctuations and links the outputs are almost close.
how ever I dont know how similar function works in php but it makes sense better at present.
but I will fork this code for persian language as soon as possible and I think it will be very usefull.
thank you very much.