weblyzard/inscriptis

exclude header & footer

Closed this issue · 1 comments

Thanks for your awesome library.
How can I exclude header & footers (for every website)?

This is a non-trivial task that requires specialized tools, rather than an HTML to text conversion library.

You could either

  1. clean up the obtained text representation (which is easy, if the headers/footers stay constant).
  2. apply technologies such as boiler-plate removal, which is described in the following paper:
    Lang, Heinz-Peter, Wohlgenannt, Gerhard and Weichselbraun, Albert. (2012). “TextSweeper - A System for Content Extraction and Overview Page Detection”. International Conference on Information Resources Management (Conf-IRM), Vienna, Austria; http://eprints.weblyzard.com/55/1/lang2012-textSweeper.pdf
  3. for more complex use cases such as Web forums you would use content extraction techniques such as HARVEST:
    Weichselbraun, Albert, Brasoveanu, Adrian M. P., Waldvogel, Roger and Odoni, Fabian. (2020). “Harvest - An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums”. 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Melbourne, Australia; https://arxiv.org/pdf/2102.02240