yooper/php-text-analysis

Find most similar

it-is-hacker-time opened this issue · 2 comments

What algoritm should I use to find the closest match from a string to a set of strings.

Example of known inputs:

I would like a cheese pizza
I would like a cheese pizza with onions
I would like a cheese pizza without onions

Input I wanna match up and find most similiar, in case there are any similar (in this example there are just spelling mistakes):

I would like a ceese pizza with out onnions.

I recommend using the cosine similarity algorithm.

$text = []
$text[] = tokenize("I would like a cheese pizza");
$text[] = tokenize("I would like a cheese pizza with onions");
$text[] = tokenize("I would like a cheese pizza without onions");
$compareAgainst = tokenize("I would like a ceese pizza with out onnions.")
$bestScore = 0;
$bestIdx = 0;    
        $compare = new CosineSimilarityComparison();
        foreach($text as $index => $t)
        {
             $score = $compare->similarity($t, $compareAgainst);
             if($score > $best) {
                 $best = $score;
                 $bestIdx = $index;
            }
        }

echo "best match {$text[$bestIdx]}";

The same code with some corrections:

`require_once('vendor/autoload.php');

use TextAnalysis\Comparisons\CosineSimilarityComparison;

$text = [];
$text[]= "I would like a cheese pizza";
$text[] = "I would like a cheese pizza with onions";
$text[] = "I would like a cheese pizza without onions";

$compareAgainst = tokenize("I would like a ceese pizza with out onnions.");

//$bestScore = 0;
$best = 0;
$bestIdx = 0;
$compare = new CosineSimilarityComparison();

    foreach($text as $index => $t)
    {
    	$t=tokenize($t);
         $score = $compare->similarity($t, $compareAgainst);
         if($score > $best) {
             $best = $score;
             $bestIdx = $index;
        }
        
    }

echo "best match {$text[$bestIdx]}";
`