Performance issue with backtracking caused by some regexes
elukey opened this issue · 4 comments
Hi! We are happy users of uap-core via https://github.com/ua-parser/uap-python, but we have recently noticed that some regexes may lead to excessive backtracking.
The regex that caused some issue for us is the following:
^(.*)/(\d+).?(\d+)?.?(\d+)?.?(\d+)? CFNetwork
A simple python script like the following hangs on my laptop for several minutes:
user_agent = "Mozilla/5.0 (X11; Linux x86_64_128) AppleWebKit/11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111.111111111111111111111111111111111111111111111111111111111111111111111111 (KHTML, like Gecko) Linux/222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222"
import re
pattern = re.compile(r'^(.*)/(\d+)\.?(\d+)?.?(\d+)?.?(\d+)? CFNetwork')
pattern.search(user_agent)
The main problem seems to be backtracking caused by the various blocks with .?, that doesn't happen if the regex is split into multiple ones for example:
^(.*)/(\d+) CFNetwork
^(.*)/(\d+)\.(\d+)? CFNetwork
^(.*)/(\d+)\.(\d+)\.(\d+)? CFNetwork
^(.*)/(\d+)\.(\d+)\.(\d+)\.?(\d+)? CFNetwork
The example that I brought up is of course a corner case, but even shorter strings with the same format can cause slow downs in parsing or hanging. It would be great to find a solution that is probably more verbose but that doesn't lead to these issues.
Thanks in advance!
Luca
For me it's getting stuck at Bots containing spider|scrape|bot(but not CUBOT)|Crawl
:
((?:[A-z0-9]+|[A-z\-]+ ?)?(?: the )?(?:[Ss][Pp][Ii][Dd][Ee][Rr]|[Ss]crape|[A-Za-z0-9-]*(?:[^C][^Uu])[Bb]ot|[Cc][Rr][Aa][Ww][Ll])[A-z0-9]*)(?:(?:[ /]| v)(\d+)(?:\.(\d+)(?:\.(\d+))?)?)?
My timing results when doubling the input user agent string length of a string of repeated letters:
> %time a = user_agent_parser.ParseUserAgent('a' * 100)
CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 47.2 ms
> %time a = user_agent_parser.ParseUserAgent('a' * 200)
CPU times: user 124 ms, sys: 4 ms, total: 128 ms
Wall time: 131 ms
> %time a = user_agent_parser.ParseUserAgent('a' * 400)
CPU times: user 612 ms, sys: 12 ms, total: 624 ms
Wall time: 626 ms
> %time a = user_agent_parser.ParseUserAgent('a' * 800)
CPU times: user 4.83 s, sys: 8 ms, total: 4.84 s
Wall time: 4.83 s
> %time a = user_agent_parser.ParseUserAgent('a' * 1600)
CPU times: user 38 s, sys: 104 ms, total: 38.1 s
Wall time: 38 s
> %time a = user_agent_parser.ParseUserAgent('a' * 3200)
CPU times: user 5min 3s, sys: 560 ms, total: 5min 4s
Wall time: 5min 3s
Problem should be solved. Could you please do a retest with a version greater 0.6.0?
If things got solved please close issue.
How does replacing x?
by (?:x|)
prevent catastrophic backtracking?
It doesn't. In both python and javascript the same REDoS performance problems still exist (confirm by trying the above examples).
I assume it's because of the way safe-regex
uses a heuristic test to find problematic regexes. I'm sorry but to me it looks like what #363 does is simply find a nice way to bypass safe-regex
's test without actually fixing the regular expressions!
Hi @bcaller, Could you please retest with latest version 0.6.3? Thanks.