Find most common sequences of words (sentences) in a 10,000-word body?

Question

Find most common sequences of words (sentences) in a 10,000-word body?

lazharichir opened this issue 9 years ago · 1 comments

Hello,

Just wondering if you could help. I am working on a project where I need to go through a website's articles and figure out for each article (2,000 to 10,000 words for each article) what are the most common phrases.

That way, I can improve the internal linking of this website.

Would this be something achievable using php-text-analysis?

Thank you,

L

Answer 1 · 2016-04-27T15:05:07.000Z

Yes, you can do this with php-text-analysis. Here is how I would approach the problem.

// pick your own stop words file
$stopwords = array_map('trim', file(VENDOR_DIR.'yooper/stop-words/data/stop-words_english_1_en.txt'));

$tokenizer = new GeneralTokenizer(" \n\t\r");
// all punctuation must be moved 1 over. Fixes issues with sentences

$spacePuncFilter = new SpacePunctuationFilter();

// I am assuming you have loaded your raw text(s), with no markup, into the array of $texts
$rakeResults = [];
foreach($texts as $text)
{   
    $tokens = $tokenizer->tokenize($spacePuncFilter->transform($text));
    $tokensDoc = new TokensDocument($tokens);
    $tokensDoc->applyTransformation(new LowerCaseFilter())
        ->applyTransformation(new StopWordsFilter($stopwords), false)
        ->applyTransformation(new PunctuationFilter(), false)
        ->applyTransformation(new CharFilter(), false);
    $rake = new Rake($tokensDoc, 3); // the 3 is for the size of the ngram, ie generate phrases of length 3
    $rakeResults[] = $rake->getKeywordScores();
}

// with $rakeResults you can extract out common keyword phrases from a single article and then look in other articles for that phrase.

// $commonPhrases will have the common phrases from the 1st and 2nd articles
$commonPhrases = array_intersect(array_keys($rakeResults[0]), array_keys($rakeResults[1]));