commoncrawl/ia-web-commons

WAT extractor: Document title bug

Closed this issue · 1 comments

There appears to be a bug in the WAT file generation where the title of a document is derived from the last <title> tag in a document instead of the first. This is a problem in documents which embed images directly and those sometimes have <title> tags of their own. For example the title in WAT for https://www.foodnavigator.com/ shows up as "Linkedin" and not as "Food Ingredients & Food Science - Additives, Flavours, Starch". Linkedin is the last <title> on that page. More examples with the same issue: https://louisville.edu/ https://www.ttu.edu/ https://www.lambeth.gov.uk/

Edit: Now that I thought about it, title tag in the head section should be used instead of the first one.

Hi @robertwaksmunski, thanks again for reporting this issue. See #37 for the solution which is already in production. The WAT files of the Oct 2024 crawl (CC-MAIN-2024-42 - to be released in a few days) should contain the correct HTML document titles.