exclude header & footer

Question

Closed this issue 7 months ago · 1 comments

Thanks for your awesome library.
How can I exclude header & footers (for every website)?

Answer 1 · 2023-12-19T09:14:43.000Z

This is a non-trivial task that requires specialized tools, rather than an HTML to text conversion library.

You could either

clean up the obtained text representation (which is easy, if the headers/footers stay constant).
apply technologies such as boiler-plate removal, which is described in the following paper:
Lang, Heinz-Peter, Wohlgenannt, Gerhard and Weichselbraun, Albert. (2012). “TextSweeper - A System for Content Extraction and Overview Page Detection”. International Conference on Information Resources Management (Conf-IRM), Vienna, Austria; http://eprints.weblyzard.com/55/1/lang2012-textSweeper.pdf
for more complex use cases such as Web forums you would use content extraction techniques such as HARVEST:
Weichselbraun, Albert, Brasoveanu, Adrian M. P., Waldvogel, Roger and Odoni, Fabian. (2020). “Harvest - An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums”. 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Melbourne, Australia; https://arxiv.org/pdf/2102.02240