Double rules on filters are too broad
wmyrda opened this issue · 9 comments
Converter creates double rules on some filter sets which lead to unintended blocks.
For example czech ruleset for instance from record /emonitor.
creates
# /emonitor. (filters.txt: 35)
/(?:(*PRUNE).*?/)?emonitor\.
emonitor.*.
and polish ruleset
# /inst. (adblock_social_list.txt: 84)
/(?:(*PRUNE).*?/)?inst\.
inst.*.
# /ins. (adblock_social_list.txt: 83)
/(?:(*PRUNE).*?/)?ins\.
ins.*.
# /in. (adblock_social_list.txt: 82)
/(?:(*PRUNE).*?/)?in\.
in.*.
While the first rule for each is OK, the later is interpreted by privoxy as the rule for whole word instead of subdomain of *
. In last case it leads to blocking of interia.pl
which should never happen.
As workaround I used sed -i -e '/\.\*\./s/^/#/' ab2p.action
which hashed out all the paterns containing .*.
which are as bug reporter in original code has put it to greedy
Hashing them out seems best way to go for now as most of those have a proper line above.
EDIT: Line above hashes out even some properly generated filters. Below hashes those lines, but with exception of the rules containing word PRUNE
sed -i -e '/\.\*\./{/PRUNE/b;s/^/#/}' ab2p.action
Thanks for this. I (believe that I) added these regex’s for privoxy rules so I need to give it some thought.
Would you please break it down very simple which (example) rule causes the issue, and where the resulting issue is visible. I’m not entirely tracking the basic issue as you’ve described it.
using polish adblock list creates in action file rule in.*.
which blocks interia.pl
which I believe is unintended and should be blocked only in case the website would be in.interia.pl
or in.interia.com
.
Here I got solution in python which better deals with extra rules needlessly set leaving out only the first as the one best matching original record.
I see this too. I’m going to try to pick a representative example from easylist.txt
to track this down. Here’s one in ab2p.action
:
# @@||www.google.*/aclk?*&adurl=$subdocument,~third-party,domain=google.ca|googl
e.co.in|google.co.nz|google.co.uk|google.com|google.com.au|google.com.eg|google.
de|google.es (easylist.txt: 69024)
.www.google.*./(?:(*PRUNE).*?/)?aclk\?(*PRUNE).*?&adurl=
I added the regex PRUNE
rules for wildcards to PatternConverter.hs in
a68371e#diff-65132c467ce6dbf2a88b898ab223a563, so the first thing I must do is make sure I didn’t mess anything up when I did that.
Added: This transformation of the original rule does not appear to be correct. Transforming by hand, the privoxy regex in ab2p.action
should be (per privoxy.org/user-manual/actions-file.html):
www.google.*/aclk\?(*PRUNE).*?&adurl=
Note:
The pattern matching syntax is different for the host and path parts of the URL. The host part uses a simple globbing type matching technique, while the path part uses more flexible "Regular Expressions" (POSIX 1003.2).
Do you agree that that .*.
pattern should not be there in this example?
Added: This transformation of the original rule does not appear to be correct. Transforming by hand, the privoxy regex in ab2p.action
should be:
.www.google.(?:(*PRUNE).*?/)?aclk\?(*PRUNE).*?&adurl=
(And I’m not sure why the initial .
is there—I’ll have to go review privoxy’s syntax.)
@wmyrda Does the diff you posted in #19 (comment) fix these issues?
I haven’t wrapped my head around this code block yet, but a few issues come to mind:
1. It looks like the (Privoxy host patterns use globbing.)String
([Char]
) initial fragment '.' : '*'
is intended to be a wildcard, in which case this should be replaced by the regex (*PRUNE).*?
.
2. Why did you find it necessary to add the changeMiddle
in the pattern match?
3. Which part of the code creates the double rules?
Here’s another example of incorrect transformation:
# /antiadblockmsg. (antiadblockfilters.txt: 85)
/(?:(*PRUNE).*?/)?antiadblockmsg\.
antiadblockmsg.*.
I don’t see how the function makePattern
could mangle this if it’s being sent correctly parsed url parts. That would narrow down the issue to the function parseUrl
. Also, there’s only a single instance of the string antiadblockmsg
so why are there two privoxy rules for /antiadblockmsg.
. Do you agree @wmyrda or have I missed something?
The issue is definitely in parseUrl
. I’ve been playing around with Debug.Trace
statements with a single-rule test.txt
file and see that this function currently returns the observed list of (incorrect) [String]
s that appear in ab2p.action
when the rule is a query with an initial /
.
Focus attention on fixing urlParse
.
These rules illustrate this function working correctly and incorrectly:
/foo.
/bar1.html
bar2.html
?q=bar3.html
*/ads/*
/banner/*/img^
||ads.example.com^
Incorrect:
/foo.
/(?:(*PRUNE).*?/)?foo\.
foo.*.
/bar1.html
/(?:(*PRUNE).*?/)?bar1\.html
bar1.html*.
bar2.html
/(*PRUNE).*?bar2\.html
.*bar2.html*.
Correct:
?q=bar3.html
/(*PRUNE).*?\?q=bar3\.html
*/ads/*
/(?:(*PRUNE).*?/)?ads/(*PRUNE).*?
/banner/*/img^
/(?:(*PRUNE).*?/)?banner/(*PRUNE).*?/img[^\w%.-]
||ads.example.com^
.ads.example.com
Fixing this will require deconstructing the functional syntax used throughout urlParse
.
I fixed these issues and others in the latest commit, e6bc337#diff-65132c467ce6dbf2a88b898ab223a563. I tested many different patterns and believe that rule parsing is working correctly now.
The function urlParse
is amazing amalgamation of Monads and Text.Parsec
functions.
A few comments if this must ever be revisited:
- Debugging these parsing Monads is pretty tough. See
parsec-trace
andDebug.Trace
, although I wasn’t ever able to get parser tracing going well. - Personally, both for readability and debugging, I would’ve used regex’s to parse the patterns.
adblock2privoxy
takes a lot of CPU cycles and memory to process the major Easylist files mentioned on theREADME.md
page. I suspect this may have to do with the efficiency of the very general parsing approach used in PatternConverter.hs.