hplt-project/warc2text-runner

Tag filters moved from warc2text

nvanva opened this issue · 4 comments

For the previous HPLT data release we used these tag filters: https://github.com/hplt-project/warc2text-runner/blob/main/mt-filter-list.annotated copied from https://github.com/paracrawl/cirrus-scripts. While reimplementing them in Python, I noticed some potential issues to discuss.

If I understood the code correctly, those documents were filtered that have any of the specified (tag, attribute) with a value, which has a substring matching the specified regexp (https://github.com/bitextor/warc2text/blob/d066592685c17f5efa2624029e6206f5a74db63f/src/html.cc#L20 employs std::regex_search).

  1. Attribute names are lowercased before comparing with the filters, but tag names are not. And matching values seems to be case-sensitive as well (https://github.com/bitextor/warc2text/blob/d066592685c17f5efa2624029e6206f5a74db63f/src/util.cc#L125). Would it be reasonable to make comparison of tags, attribute names and values all case-insensitive?

  2. This filter:

    a onclick doGTranslate\(\'.{2}\|.{2}\'\)

    will not match calls of doGTranslate with non-2letter language codes (one example found on the Web: <a href="#" onclick="doGTranslate('en|zh-CN'); return false;" https://gtranslate.io/forum/callback-dogtranslate-function-t3989.html). Should we reconsider this filter?

I think it is a good idea to relax matching critera to match a borader set of cases in both scenarios. Maybe for 2. something like doGTranslate\(\'.{2,6}\|.{2,6}\'\)?

What would you say about just ```doGTranslate(`` ? It is much simpler, can be checked even without regular expressions. It includes everything that matches the current version, so should not give false negatives . Can it result in any false positives?

That seems fine also.

Done