Extract Keywords, Title and Description from Website headers
Closed this issue · 6 comments
Hey guys, could you extract title, keywords and description from website headers and include them in the outputs? I think that would be a valuable addition to the great ARGUS tool and could be very useful for downstream analysis.
To clearify what i mean by title keywords and description please see two examples below with outputs from http://keyword-analyse.com/.
The information from the website header (title, description, keywords) was now added to the scraped text of the mainpage, e.g.
"Title:Microsoft - Official Home Page;Description:At Microsoft our mission and values are to help people and businesses throughout the world realize their full potential.;Keywords:;Text:Jeden Tag eine neue Chance, mehr zu erreichen..."
You can find this version under the branch "header_info". Did you envision this enhancement differently or does it meet your expectations?
Thanks for the update!
I haven't tested the enhancement yet, however from your example the changes look very reasonable to me.
As a side node, I envisioned it such that each of the meta information fields gets it own column in the output file, however I think incorporating meta information in the text column is just fine, as end users can easily extract the relevant passages into separate columns theirselves without much ado.
We could also do it like David suggested and create a new column for this information.
I updated the branch "header_info" based on your suggestion and it works for the main pages. However, the output for the subpages is still erroneous, but I have not yet been able to figure out the error.
I updated the code. Header/meta info will now only displayed in the row of the main page/dl_rank=0 page. This reduces memory usage and also works with "Aggregate Webpage Texts" function. There are three new columns now: title, description, keywords.
To be done:
- There is still an issue with newline characters in the new columns. I fix that on monday.
- We need to update the readme/documentation including new screenshots of the data structure.
I will update this feature such that meta info is extracted from each (sub)webpage.