Extract Keywords, Title and Description from Website headers

Question

Extract Keywords, Title and Description from Website headers

Closed this issue 6 years ago · 6 comments

Hey guys, could you extract title, keywords and description from website headers and include them in the outputs? I think that would be a valuable addition to the great ARGUS tool and could be very useful for downstream analysis.

To clearify what i mean by title keywords and description please see two examples below with outputs from http://keyword-analyse.com/.

https://eshop.wuerth.de/de/DE/EUR/

https://www.microsoft.com/de-de/

Answer 1 · 2019-03-25T10:09:02.000Z

The information from the website header (title, description, keywords) was now added to the scraped text of the mainpage, e.g.

"Title:Microsoft - Official Home Page;Description:At Microsoft our mission and values are to help people and businesses throughout the world realize their full potential.;Keywords:;Text:Jeden Tag eine neue Chance, mehr zu erreichen..."

You can find this version under the branch "header_info". Did you envision this enhancement differently or does it meet your expectations?

Answer 2 · 2019-03-25T12:32:55.000Z

Thanks for the update!

I haven't tested the enhancement yet, however from your example the changes look very reasonable to me.
As a side node, I envisioned it such that each of the meta information fields gets it own column in the output file, however I think incorporating meta information in the text column is just fine, as end users can easily extract the relevant passages into separate columns theirselves without much ado.

Answer 3 · 2019-03-25T12:56:30.000Z

We could also do it like David suggested and create a new column for this information.

Answer 4 · 2019-03-27T09:53:08.000Z

I updated the branch "header_info" based on your suggestion and it works for the main pages. However, the output for the subpages is still erroneous, but I have not yet been able to figure out the error.

Answer 5 · 2019-03-29T12:32:39.000Z

I updated the code. Header/meta info will now only displayed in the row of the main page/dl_rank=0 page. This reduces memory usage and also works with "Aggregate Webpage Texts" function. There are three new columns now: title, description, keywords.

To be done:

There is still an issue with newline characters in the new columns. I fix that on monday.
We need to update the readme/documentation including new screenshots of the data structure.

Answer 6 · 2019-04-04T07:43:28.000Z

I will update this feature such that meta info is extracted from each (sub)webpage.