oalders/http-browserdetect

Pathologically slow UA string

mfontani opened this issue · 5 comments

At $work, we instantiate HTTP::BrowserDetect for every request to allow templates to react to properties of the browser and, say, output some CSS or JS only for specific browser families.

In recent days, I've noticed a few web controllers taking an unreasonably high time to respond, or even timing out. I've tracked that down to a likely crawler using a very long (8KiB+) user-agent string "made up of" the same "base" user-agent string (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36) repeated N times (70+). Merely requesting a page with a user-agent string containing the above string repeated 70 times is enough to make any page on the site doing a HTTP::BrowserDetect->new($ua) take 15+ seconds to be delivered, and it's all apparently taken up by HTTP::BrowserDetect->new($ua) call itself. Further (but short!) NYTProf running showed me that that high time seems to be mostly taken up by "...::CORE::match", at which point I stopped, thinking it's likely it's a pathological regex, but it wasn't easy to find which one it was via NYTProf alone.

I'm in the process of warding against this problem in other ways, but wondered whether this is something that could be looked into and/or for pointers as to how to help finding the problem and go about fixing it myself for a PR.

Example program:

use HTTP::BrowserDetect;
my $base_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36';
my $ua      = join ' ', $base_ua x 70; $ua =~ s!\s+\z!!xms;
my $t0      = time;
my $bd      = HTTP::BrowserDetect->new($ua);
print 'Done in ', time - $t0, "s.\n"; # pathological - 17s
$base_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106';
$ua      = join ' ', $base_ua x 70; $ua =~ s!\s+\z!!xms;
$t0      = time;
$bd      = HTTP::BrowserDetect->new($ua);
print 'Done in ', time - $t0, "s.\n"; # non-pathological, 1s

Thanks very much for this! I did some quick looking around and basically there's no upper limit on how long a UA string can be. It's more a function of how large a header can be sent. Also as far as a UA spec goes, that's pretty much a dead end. It looks like you can have a legitimate UA string that can be quite long.

I actually haven't checked the code for hot spots in the past, so I can't speculate as to whether it's one or many regexes which may need to be optimized. I also don't have any helpful suggestions other than to return early from some of the regex parsing to see if hot spots can be found by seeing if the code is faster when sections are avoided.

Having said all that, I'm happy to cut a quick release once this is fixed.

This seems to be the offendingly slow regex, which I'm also (sheepishly) the last person to have touched...:

    }
    elsif ( $ua
        =~ m{^mozilla/.+windows (?:nt|phone) \d{2}\.\d+;?.+ applewebkit/.+ chrome/.+ safari/.+ edge/[\d.]+$}
    ) {
        $browser        = 'edge';
        $browser_string = 'Edge';

        $browser_tests->{edge} = 1;
    }

I'll see what I can do to "rein in" all that backtracking, and will attach a patch / push a PR if successful.

The above referenced commit seems to work for me (i.e. all current tests continue to pass) while the time taken on the pathological string goes down from 15s+ to a few 100's of ms.

Thanks! Do you want to create a pull request from that?

I'm discussing the patch with a colleague so we can get a hopefully better regex out. I'll create a PR on monday or as soon as I'm happy with it fully.