monperrus/crawler-user-agents

Ordering by most common

plbowers opened this issue · 10 comments

Most of the time people using this code will be hoping to identify bots as quickly as possible. Attempting to put them in order according to most commonly identified bots would speed up the process, allowing to optimize and get out quickly.

I did a very quick optimization using the frequency reported on this page:

https://deviceatlas.com/blog/list-of-web-crawlers-user-agents

And then I put all your patterns (concatenated with |) into 2 preg_match() calls:

if (preg_match(/most|common|patterns/, $_SERVER['HTTP_USER_AGENT'] || preg_match(/less|common|patterns/, $_SERVER['HTTP_USER_AGENT']) {
// is a bot
} else {
// isn't a bot
}

Providing a script to produce that might be a help...?

interesting comment!

one option is to add a field "prevalence"/"priority"/ that reflects this information, and that could be used to generate the regexp in the right order.

WDYT?

Sure - that would be a good solution.

@plbowers have you got any hard benchmark figures proving that your method would indeed be significantly faster?

This would be a strange optimisation to do unless more than 50% of your User-Agent tests match the crawler list. This is because for non-crawer traffic both regex groups state 'no match', so you would e optimising for something that occurs rarely; assuming your User-Agent traffic is 95%+ non-crawler.

If you are looking for lowering latency, then you should look to using a language (or maybe PHP has a C an extension) that lets you compile the concatinated version of RE (which then means ordering is irrelevent):

For example:

Some languages cache the compiled version automatically for you (I cannot see if PHP does too):

  • Perl
  • JavaScript - I suspect that is why .compile() is now deprecated
Fale commented

I tried multiple cases using https://godoc.org/go.kelfa.io/kelfa/pkg/crawlerflagger (it's written in Go).

It exposes 2 ways to query the crawler-user-agents list:

  • ExactMatch (it uses the "instances" field)
  • RegexpMatch (it uses the "pattern" field)

I tried to match the 1st entry, the 100th entry, the 200th entry, the 300th entry, the 400th entry, and a non-existent entry, those are the results:

BenchmarkName                         Iterations         Average (nanoseconds/operation)
BenchmarkExactMatch/case0-8             10000000               182 ns/op
BenchmarkExactMatch/case101-8           10000000               128 ns/op
BenchmarkExactMatch/case200-8           10000000               137 ns/op
BenchmarkExactMatch/case300-8           10000000               124 ns/op
BenchmarkExactMatch/case400-8           20000000               113 ns/op
BenchmarkExactMatch/miss-8             200000000                 8.03 ns/op
BenchmarkRegExpMatch/case0-8             5000000               292 ns/op
BenchmarkRegExpMatch/case101-8            200000              7335 ns/op
BenchmarkRegExpMatch/case200-8            100000             12866 ns/op
BenchmarkRegExpMatch/case300-8            100000             21898 ns/op
BenchmarkRegExpMatch/case400-8             50000             26515 ns/op
BenchmarkRegExpMatch/miss-8                50000             23963 ns/op

So it seems to suggest that for "instances" based match (at least in Go) the order has absolutely no relevance, while it has relevance for "pattern" based in match (at least in Go).

Interesting, do you also test with a single pattern concatenating all patterns with |?

Fale commented

At the moment there are 400+ regexp (one per entry) and then a switch to analyse the case that matches.

The reason which made me implement it in this way is that I'm not really sure how to identify which case then is matching. Basically it would be possible to decide if at least one pattern is matched by the input string, but not which one.

@monperrus Since most of the bots user agent has bot|crawler|spider ,we could group all the bots useragents with bot|crawl|spider to a single pattern like this regex might help. Improving this regex to a single pattern will reduce the number of patterns to be matched .

A generic regex like that is a good idea but you do have to be very careful not to create false positives. You can’t have bot as part of that regex as there are a few genuine user-agents that have bot as part of their name, Cubot for example.

The best way to increase the performance of as regex such as this, is to remove common strings from the source user-agent.

As you can see here...
https://github.com/JayBizzle/Crawler-Detect/blob/master/src/Fixtures/Exclusions.php
...we run a regex replace on the user agent first that removes any of the common matches before running the bot regexes.

We saw a 55% speed increase doing this.

Grouping patterns is on the user side, as in @JayBizzle 's example.

Note that we'd be happy to merge example code snippets for grouping in the README.