/phpscraper-keyword-length-distribution-example

Example to demonstration the parsing of keywords as well as the simple analysis of the data to get a length distribution

Primary LanguagePHPGNU General Public License v3.0GPL-3.0

Keyword Length Distribution Example using PHPScraper

PHPScraper is a scraping library aimed at making web-scraping easier. It simplifies the coding effort involved by reducing verbosity.

This is an example of the library scraping keywords from the Wikipedia article "Online Advertising". After the keyword extraction, the data is processed to analyze the distribution of keyword length. The expected output can be found below.

Within PHPScraper, the library RAKE PHP Plus is used. RAKE stands for "Rapid Automatic Keyword Extraction" algorithm.

You can also check the PHPScraper keyword extraction example and Keyword Merge Package.

Installation

This example has been built on PHP 7.2.24 run on an Ubuntu-based Linux distro.

To run this example you will need to clone the repository and install the dependencies:

git clone git@github.com:spekulatius/phpscraper-keyword-scraping-example.git
composer install

If you would like to make changes you will need to fork the repository.

Execution

$ php keyword-length-distribution.php

Result

As graphic:

Keyword Length Distribution for "Online Marketing"

As text:

This page contains around 1989 keywords/phrases.
Below are some selected keyword extractions.

Length Distribution of Keywords:

Array
(
    [1] => 7
    [2] => 5
    [3] => 46
    [4] => 95
    [5] => 80
    [6] => 84
    [7] => 129
    [8] => 137
    [9] => 117
    [10] => 103
    [11] => 91
    [12] => 71
    [13] => 76
    [14] => 58
    [15] => 82
    [16] => 71
    [17] => 72
    [18] => 76
    [19] => 51
    [20] => 57
    [21] => 40
    [22] => 33
    [23] => 45
    [24] => 34
    [25] => 21
    [26] => 29
    [27] => 29
    [28] => 17
    [29] => 22
    [30] => 17
    [31] => 17
    [32] => 11
    [33] => 11
    [34] => 10
    [35] => 11
    [36] => 8
    [37] => 7
    [38] => 10
    [39] => 10
    [40] => 3
    [41] => 5
    [42] => 4
    [43] => 5
    [44] => 4
    [45] => 5
    [46] => 2
    [47] => 3
    [48] => 2
    [49] => 2
    [51] => 3
    [52] => 3
    [53] => 3
    [54] => 1
    [55] => 2
    [56] => 3
    [57] => 2
    [59] => 1
    [61] => 1
    [66] => 1
    [67] => 1
    [70] => 1
    [71] => 1
    [76] => 2
    [77] => 1
    [81] => 2
    [84] => 1
    [85] => 1
    [93] => 1
    [97] => 1
    [99] => 1
    [106] => 1
    [107] => 1
    [110] => 1
    [120] => 1
    [121] => 1
    [123] => 1
    [142] => 1
    [148] => 1
    [156] => 1
    [159] => 1
    [166] => 1
    [169] => 1
    [174] => 1
    [191] => 1
    [193] => 1
    [195] => 1
    [201] => 1
    [205] => 1
    [229] => 1
    [252] => 1
    [269] => 1
    [280] => 1
    [288] => 1
    [308] => 1
    [332] => 1
    [392] => 1
    [408] => 1
    [473] => 1
    [506] => 1
    [1422] => 1
    [2184] => 1
)

1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,51,52,53,54,55,56,57,59,61,66,67,70,71,76,77,81,84,85,93,97,99,106,107,110,120,121,123,142,148,156,159,166,169,174,191,193,195,201,205,229,252,269,280,288,308,332,392,408,473,506,1422,2184

7,5,46,95,80,84,129,137,117,103,91,71,76,58,82,71,72,76,51,57,40,33,45,34,21,29,29,17,22,17,17,11,11,10,11,8,7,10,10,3,5,4,5,4,5,2,3,2,2,3,3,3,1,2,3,2,1,1,1,1,1,1,2,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

Please note: These results might have changed by now.