Tag filters moved from warc2text
nvanva opened this issue · 4 comments
For the previous HPLT data release we used these tag filters: https://github.com/hplt-project/warc2text-runner/blob/main/mt-filter-list.annotated copied from https://github.com/paracrawl/cirrus-scripts. While reimplementing them in Python, I noticed some potential issues to discuss.
If I understood the code correctly, those documents were filtered that have any of the specified (tag, attribute) with a value, which has a substring matching the specified regexp (https://github.com/bitextor/warc2text/blob/d066592685c17f5efa2624029e6206f5a74db63f/src/html.cc#L20 employs std::regex_search).
-
Attribute names are lowercased before comparing with the filters, but tag names are not. And matching values seems to be case-sensitive as well (https://github.com/bitextor/warc2text/blob/d066592685c17f5efa2624029e6206f5a74db63f/src/util.cc#L125). Would it be reasonable to make comparison of tags, attribute names and values all case-insensitive?
-
This filter:
warc2text-runner/mt-filter-list.annotated
Line 32 in 6eb033a
will not match calls of doGTranslate with non-2letter language codes (one example found on the Web: <a href="#" onclick="doGTranslate('en|zh-CN'); return false;" https://gtranslate.io/forum/callback-dogtranslate-function-t3989.html). Should we reconsider this filter?
I think it is a good idea to relax matching critera to match a borader set of cases in both scenarios. Maybe for 2. something like doGTranslate\(\'.{2,6}\|.{2,6}\'\)
?
What would you say about just ```doGTranslate(`` ? It is much simpler, can be checked even without regular expressions. It includes everything that matches the current version, so should not give false negatives . Can it result in any false positives?
That seems fine also.
Done