scrapinghub/webstruct

Not possible to annotate <button>, <option> etc. elements

david-macleod opened this issue · 0 comments

Using the Web Annotater Firefox extension it is not possible to annotate text which is the descendant of certain (interactive) html elements. I have noticed <button> and <option> so far but there may be others.

This can lead to (apparent) false positives when predictions are made on text belonging to these elements, which will affect the model, and the resultant metrics.

I firstly wanted to confirm that it is indeed impossible to add annotations to these elements, and if so, I have two questions:

  1. What is the cleanest way to remove these tags?
  2. Do you think this should become a webstruct default?

Within the HtmlTokenizer constructor we have a few options available but none seem suitable for this task.

  • ignore_html_tags will ignore the element and its children, but will also remove any tail text e.g.

<html><body>start<option>hello</option>end</body></html> the text "end" will be lost here

  • kill_html_tags will drop the element and its children and preserve tail text, but this requires keep_child = False, and this parameter is not exposed at the class level.

I would be happy to create a PR/tests for this if required.