Find most common sequences of words (sentences) in a 10,000-word body?
lazharichir opened this issue · 1 comments
lazharichir commented
Hello,
Just wondering if you could help. I am working on a project where I need to go through a website's articles and figure out for each article (2,000 to 10,000 words for each article) what are the most common phrases.
That way, I can improve the internal linking of this website.
Would this be something achievable using php-text-analysis?
Thank you,
L
yooper commented
Yes, you can do this with php-text-analysis. Here is how I would approach the problem.
// pick your own stop words file
$stopwords = array_map('trim', file(VENDOR_DIR.'yooper/stop-words/data/stop-words_english_1_en.txt'));
$tokenizer = new GeneralTokenizer(" \n\t\r");
// all punctuation must be moved 1 over. Fixes issues with sentences
$spacePuncFilter = new SpacePunctuationFilter();
// I am assuming you have loaded your raw text(s), with no markup, into the array of $texts
$rakeResults = [];
foreach($texts as $text)
{
$tokens = $tokenizer->tokenize($spacePuncFilter->transform($text));
$tokensDoc = new TokensDocument($tokens);
$tokensDoc->applyTransformation(new LowerCaseFilter())
->applyTransformation(new StopWordsFilter($stopwords), false)
->applyTransformation(new PunctuationFilter(), false)
->applyTransformation(new CharFilter(), false);
$rake = new Rake($tokensDoc, 3); // the 3 is for the size of the ngram, ie generate phrases of length 3
$rakeResults[] = $rake->getKeywordScores();
}
// with $rakeResults you can extract out common keyword phrases from a single article and then look in other articles for that phrase.
// $commonPhrases will have the common phrases from the 1st and 2nd articles
$commonPhrases = array_intersect(array_keys($rakeResults[0]), array_keys($rakeResults[1]));