Double rules on filters are too broad

Question

Double rules on filters are too broad

wmyrda opened this issue 7 years ago · 9 comments

Converter creates double rules on some filter sets which lead to unintended blocks.

For example czech ruleset for instance from record /emonitor. creates

# /emonitor. (filters.txt: 35)
/(?:(*PRUNE).*?/)?emonitor\.
emonitor.*.

and polish ruleset

# /inst. (adblock_social_list.txt: 84)
/(?:(*PRUNE).*?/)?inst\.
inst.*.
# /ins. (adblock_social_list.txt: 83)
/(?:(*PRUNE).*?/)?ins\.
ins.*.
# /in. (adblock_social_list.txt: 82)
/(?:(*PRUNE).*?/)?in\.
in.*.

While the first rule for each is OK, the later is interpreted by privoxy as the rule for whole word instead of subdomain of *. In last case it leads to blocking of interia.pl which should never happen.

Answer 1 · 2018-06-04T19:27:06.000Z

As workaround I used sed -i -e '/\.\*\./s/^/#/' ab2p.action which hashed out all the paterns containing .*. which are as bug reporter in original code has put it to greedy

Hashing them out seems best way to go for now as most of those have a proper line above.

EDIT: Line above hashes out even some properly generated filters. Below hashes those lines, but with exception of the rules containing word PRUNE
sed -i -e '/\.\*\./{/PRUNE/b;s/^/#/}' ab2p.action

Answer 2 · 2018-06-14T11:29:28.000Z

Thanks for this. I (believe that I) added these regex’s for privoxy rules so I need to give it some thought.

Would you please break it down very simple which (example) rule causes the issue, and where the resulting issue is visible. I’m not entirely tracking the basic issue as you’ve described it.

Answer 3 · 2018-06-14T12:09:23.000Z

using polish adblock list creates in action file rule in.*. which blocks interia.pl which I believe is unintended and should be blocked only in case the website would be in.interia.pl or in.interia.com.

Answer 4 · 2018-06-30T12:54:53.000Z

Here I got solution in python which better deals with extra rules needlessly set leaving out only the first as the one best matching original record.

Answer 5 · 2018-09-19T07:51:56.000Z

I see this too. I’m going to try to pick a representative example from easylist.txt to track this down. Here’s one in ab2p.action:

# @@||www.google.*/aclk?*&adurl=$subdocument,~third-party,domain=google.ca|googl
e.co.in|google.co.nz|google.co.uk|google.com|google.com.au|google.com.eg|google.
de|google.es (easylist.txt: 69024)
.www.google.*./(?:(*PRUNE).*?/)?aclk\?(*PRUNE).*?&adurl=

I added the regex PRUNE rules for wildcards to PatternConverter.hs in
a68371e#diff-65132c467ce6dbf2a88b898ab223a563, so the first thing I must do is make sure I didn’t mess anything up when I did that.

Added: This transformation of the original rule does not appear to be correct. Transforming by hand, the privoxy regex in ab2p.action should be (per privoxy.org/user-manual/actions-file.html):

www.google.*/aclk\?(*PRUNE).*?&adurl=

Note:

The pattern matching syntax is different for the host and path parts of the URL. The host part uses a simple globbing type matching technique, while the path part uses more flexible "Regular Expressions" (POSIX 1003.2).

Do you agree that that .*. pattern should not be there in this example?

Added: This transformation of the original rule does not appear to be correct. Transforming by hand, the privoxy regex in ab2p.action should be:

~~.www.google.(?:(*PRUNE).*?/)?aclk\?(*PRUNE).*?&adurl=~~

~~(And I’m not sure why the initial . is there—I’ll have to go review privoxy’s syntax.)~~

Answer 6 · 2018-09-19T08:05:41.000Z

@wmyrda Does the diff you posted in #19 (comment) fix these issues?

I haven’t wrapped my head around this code block yet, but a few issues come to mind:
~~1. It looks like the String ([Char]) initial fragment '.' : '*' is intended to be a wildcard, in which case this should be replaced by the regex (*PRUNE).*?.~~ (Privoxy host patterns use globbing.)
2. Why did you find it necessary to add the changeMiddle in the pattern match?
3. Which part of the code creates the double rules?

Answer 7 · 2018-09-19T12:57:03.000Z

Here’s another example of incorrect transformation:

# /antiadblockmsg. (antiadblockfilters.txt: 85)
/(?:(*PRUNE).*?/)?antiadblockmsg\.
antiadblockmsg.*.

I don’t see how the function makePattern could mangle this if it’s being sent correctly parsed url parts. That would narrow down the issue to the function parseUrl. Also, there’s only a single instance of the string antiadblockmsg so why are there two privoxy rules for /antiadblockmsg.. Do you agree @wmyrda or have I missed something?

Answer 8 · 2018-09-19T19:19:18.000Z

The issue is definitely in parseUrl. I’ve been playing around with Debug.Trace statements with a single-rule test.txt file and see that this function currently returns the observed list of (incorrect) [String]s that appear in ab2p.action when the rule is a query with an initial /.

Focus attention on fixing urlParse.

These rules illustrate this function working correctly and incorrectly:

/foo.
/bar1.html
bar2.html
?q=bar3.html
*/ads/*
/banner/*/img^
||ads.example.com^

Incorrect:

/foo. ➡️

/(?:(*PRUNE).*?/)?foo\.
foo.*.

/bar1.html ➡️

/(?:(*PRUNE).*?/)?bar1\.html
bar1.html*.

bar2.html ➡️

/(*PRUNE).*?bar2\.html
.*bar2.html*.

Correct:

?q=bar3.html ➡️
/(*PRUNE).*?\?q=bar3\.html

*/ads/* ➡️
/(?:(*PRUNE).*?/)?ads/(*PRUNE).*?

/banner/*/img^ ➡️
/(?:(*PRUNE).*?/)?banner/(*PRUNE).*?/img[^\w%.-]

||ads.example.com^ ➡️
.ads.example.com

Fixing this will require deconstructing the functional syntax used throughout urlParse.

Answer 9 · 2018-09-20T21:53:37.000Z

I fixed these issues and others in the latest commit, e6bc337#diff-65132c467ce6dbf2a88b898ab223a563. I tested many different patterns and believe that rule parsing is working correctly now.

The function urlParse is amazing amalgamation of Monads and Text.Parsec functions.

A few comments if this must ever be revisited:

Debugging these parsing Monads is pretty tough. See parsec-trace and Debug.Trace, although I wasn’t ever able to get parser tracing going well.
Personally, both for readability and debugging, I would’ve used regex’s to parse the patterns.
adblock2privoxy takes a lot of CPU cycles and memory to process the major Easylist files mentioned on the README.md page. I suspect this may have to do with the efficiency of the very general parsing approach used in PatternConverter.hs.