JayBizzle/Crawler-Detect

Regex concatenation performance concern

Bilge opened this issue · 3 comments

Bilge commented

public function compileRegex($patterns)
{
return '('.implode('|', $patterns).')';
}

This mass concatenation of pattern pieces into a final regex should be precomputed by the library instead of making consuming applications perform this work on every instantiation of the CrawlerDetect object. That is, the fixture arrays should be transformed into the final regex string and output to a precomputed PHP file to save consumers from this unnecessary overhead. I suggest a PHP file because opcache can automatically cache the file for additional performance gains.

Example

patterns.php

<?php
return '/(.*Java.*outbrain| YLT|^b0t$|^bluefish ...)/i';

Again, I would like to see some benchmark figures to see what kind of improvement could be gained here.

Concatenating the regex into one precompiled regex string would also not allow us to easily test regex collisions...

public function there_are_no_regex_collisions()
{
$crawlers = new Crawlers();
foreach ($crawlers->getAll() as $key1 => $regex) {
foreach ($crawlers->getAll() as $key2 => $compare) {
// Dont check this regex against itself
if ($key1 != $key2) {
preg_match('/'.$regex.'/i', stripslashes($compare), $matches);
$this->assertEmpty($matches, $regex.' collided with '.$compare);
}
}
}
}

Bilge commented

Concatenating the regex into one precompiled regex string would also not allow us to easily test regex collisions...

That's simply not true because the precomputed list cannot be maintained if the original list is deleted. Ergo the original array still exists and can still be used in the test. They're not mutually exclusive.

An example would help here, it's not clear what you mean