ua-parser/uap-core

Performance issue with backtracking caused by some regexes

elukey opened this issue · 4 comments

Hi! We are happy users of uap-core via https://github.com/ua-parser/uap-python, but we have recently noticed that some regexes may lead to excessive backtracking.

The regex that caused some issue for us is the following:

^(.*)/(\d+).?(\d+)?.?(\d+)?.?(\d+)? CFNetwork

A simple python script like the following hangs on my laptop for several minutes:

user_agent = "Mozilla/5.0 (X11; Linux x86_64_128) AppleWebKit/11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111.111111111111111111111111111111111111111111111111111111111111111111111111 (KHTML, like Gecko) Linux/222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222"
import re
pattern = re.compile(r'^(.*)/(\d+)\.?(\d+)?.?(\d+)?.?(\d+)? CFNetwork')
pattern.search(user_agent)

The main problem seems to be backtracking caused by the various blocks with .?, that doesn't happen if the regex is split into multiple ones for example:

^(.*)/(\d+) CFNetwork
^(.*)/(\d+)\.(\d+)? CFNetwork
^(.*)/(\d+)\.(\d+)\.(\d+)? CFNetwork
^(.*)/(\d+)\.(\d+)\.(\d+)\.?(\d+)? CFNetwork

The example that I brought up is of course a corner case, but even shorter strings with the same format can cause slow downs in parsing or hanging. It would be great to find a solution that is probably more verbose but that doesn't lead to these issues.

Thanks in advance!

Luca

For me it's getting stuck at Bots containing spider|scrape|bot(but not CUBOT)|Crawl:

((?:[A-z0-9]+|[A-z\-]+ ?)?(?: the )?(?:[Ss][Pp][Ii][Dd][Ee][Rr]|[Ss]crape|[A-Za-z0-9-]*(?:[^C][^Uu])[Bb]ot|[Cc][Rr][Aa][Ww][Ll])[A-z0-9]*)(?:(?:[ /]| v)(\d+)(?:\.(\d+)(?:\.(\d+))?)?)?

My timing results when doubling the input user agent string length of a string of repeated letters:

> %time a = user_agent_parser.ParseUserAgent('a' * 100)
CPU times: user 48 ms, sys: 0 ns, total: 48 ms
Wall time: 47.2 ms

> %time a = user_agent_parser.ParseUserAgent('a' * 200)
CPU times: user 124 ms, sys: 4 ms, total: 128 ms
Wall time: 131 ms

> %time a = user_agent_parser.ParseUserAgent('a' * 400)
CPU times: user 612 ms, sys: 12 ms, total: 624 ms
Wall time: 626 ms

> %time a = user_agent_parser.ParseUserAgent('a' * 800)
CPU times: user 4.83 s, sys: 8 ms, total: 4.84 s
Wall time: 4.83 s

> %time a = user_agent_parser.ParseUserAgent('a' * 1600)
CPU times: user 38 s, sys: 104 ms, total: 38.1 s
Wall time: 38 s

> %time a = user_agent_parser.ParseUserAgent('a' * 3200)
CPU times: user 5min 3s, sys: 560 ms, total: 5min 4s
Wall time: 5min 3s

Problem should be solved. Could you please do a retest with a version greater 0.6.0?
If things got solved please close issue.

How does replacing x? by (?:x|) prevent catastrophic backtracking?

It doesn't. In both python and javascript the same REDoS performance problems still exist (confirm by trying the above examples).

I assume it's because of the way safe-regex uses a heuristic test to find problematic regexes. I'm sorry but to me it looks like what #363 does is simply find a nice way to bypass safe-regex's test without actually fixing the regular expressions!

Hi @bcaller, Could you please retest with latest version 0.6.3? Thanks.