Ordering by most common
plbowers opened this issue · 10 comments
Most of the time people using this code will be hoping to identify bots as quickly as possible. Attempting to put them in order according to most commonly identified bots would speed up the process, allowing to optimize and get out quickly.
I did a very quick optimization using the frequency reported on this page:
https://deviceatlas.com/blog/list-of-web-crawlers-user-agents
And then I put all your patterns (concatenated with |) into 2 preg_match() calls:
if (preg_match(/most|common|patterns/, $_SERVER['HTTP_USER_AGENT'] || preg_match(/less|common|patterns/, $_SERVER['HTTP_USER_AGENT']) {
// is a bot
} else {
// isn't a bot
}
Providing a script to produce that might be a help...?
interesting comment!
one option is to add a field "prevalence"/"priority"/ that reflects this information, and that could be used to generate the regexp in the right order.
WDYT?
Sure - that would be a good solution.
@plbowers have you got any hard benchmark figures proving that your method would indeed be significantly faster?
This would be a strange optimisation to do unless more than 50% of your User-Agent tests match the crawler list. This is because for non-crawer traffic both regex groups state 'no match', so you would e optimising for something that occurs rarely; assuming your User-Agent traffic is 95%+ non-crawler.
If you are looking for lowering latency, then you should look to using a language (or maybe PHP has a C an extension) that lets you compile the concatinated version of RE (which then means ordering is irrelevent):
For example:
Some languages cache the compiled version automatically for you (I cannot see if PHP does too):
- Perl
- JavaScript - I suspect that is why
.compile()
is now deprecated
I tried multiple cases using https://godoc.org/go.kelfa.io/kelfa/pkg/crawlerflagger (it's written in Go).
It exposes 2 ways to query the crawler-user-agents list:
- ExactMatch (it uses the "instances" field)
- RegexpMatch (it uses the "pattern" field)
I tried to match the 1st entry, the 100th entry, the 200th entry, the 300th entry, the 400th entry, and a non-existent entry, those are the results:
BenchmarkName Iterations Average (nanoseconds/operation)
BenchmarkExactMatch/case0-8 10000000 182 ns/op
BenchmarkExactMatch/case101-8 10000000 128 ns/op
BenchmarkExactMatch/case200-8 10000000 137 ns/op
BenchmarkExactMatch/case300-8 10000000 124 ns/op
BenchmarkExactMatch/case400-8 20000000 113 ns/op
BenchmarkExactMatch/miss-8 200000000 8.03 ns/op
BenchmarkRegExpMatch/case0-8 5000000 292 ns/op
BenchmarkRegExpMatch/case101-8 200000 7335 ns/op
BenchmarkRegExpMatch/case200-8 100000 12866 ns/op
BenchmarkRegExpMatch/case300-8 100000 21898 ns/op
BenchmarkRegExpMatch/case400-8 50000 26515 ns/op
BenchmarkRegExpMatch/miss-8 50000 23963 ns/op
So it seems to suggest that for "instances" based match (at least in Go) the order has absolutely no relevance, while it has relevance for "pattern" based in match (at least in Go).
Interesting, do you also test with a single pattern concatenating all patterns with |
?
At the moment there are 400+ regexp (one per entry) and then a switch to analyse the case that matches.
The reason which made me implement it in this way is that I'm not really sure how to identify which case then is matching. Basically it would be possible to decide if at least one pattern is matched by the input string, but not which one.
@monperrus Since most of the bots user agent has bot|crawler|spider ,we could group all the bots useragents with bot|crawl|spider to a single pattern like this regex might help. Improving this regex to a single pattern will reduce the number of patterns to be matched .
A generic regex like that is a good idea but you do have to be very careful not to create false positives. You can’t have bot
as part of that regex as there are a few genuine user-agents that have bot
as part of their name, Cubot for example.
The best way to increase the performance of as regex such as this, is to remove common strings from the source user-agent.
As you can see here...
https://github.com/JayBizzle/Crawler-Detect/blob/master/src/Fixtures/Exclusions.php
...we run a regex replace on the user agent first that removes any of the common matches before running the bot regexes.
We saw a 55% speed increase doing this.
Grouping patterns is on the user side, as in @JayBizzle 's example.
Note that we'd be happy to merge example code snippets for grouping in the README.