commoncrawl/ia-web-commons

Complete HTML link extraction to cover all element attributes of type URI

Closed this issue · 0 comments

The HTML specs provide list of attributes including the required type. All attributes of type URI should be covered by the ExtractingParseObserver when links are extracted and added as "Links" to the WAT file. See

Several attributes are missing, e.g., "cite" for <q> and <blockquote>, or embedded elements introduced with HTML5 (<video>, <audio>).

This issue precedes #7 and #8, and should include a unit test.