shadowhand/latitude

Faster LIKE-escaping (code inside)

aitte2 opened this issue · 2 comments

I noticed this "heavy" code that's doing three separate str_replace calls which each has to create a string, parse it and return each string. Which is going to suck if the string is a very long one:

/**
* Escape input for a LIKE condition value.
*/
public static function escape(string $value): string
{
// Backslash is used to escape wildcards.
$value = str_replace('\\', '\\\\', $value);
// Standard wildcards are underscore and percent sign.
$value = str_replace('%', '\\%', $value);
$value = str_replace('_', '\\_', $value);
return $value;
}

And you do them in a special order to ensure backslashes are processed first, etc...

I am not 100% sure but I THINK that all of that can be handled like this instead:

$str = 'a\\_b\\c%'; // Input: a\_b\c%
var_dump(strtr($str, ['\\' => '\\\\', '%' => '\\%', '_' => '\\_'])); // Result: a\\\_b\\c\%

As PHP's manual says, strtr will not replace a replacement, so that is why I think this should be safe: http://php.net/strtr

And as you can see, it's just a SINGLE function call which handles it all in native C code! No more repeated, wasteful copying back/forth of strings between native str_replace() and PHP. :-)

PS: Any other code locations that do this kinda escaping benefits from the same idea... :-) Maybe the library has a generic escape() / quote()? I haven't started using it yet, but this idea would be good there too.

I think this is probably a micro-optimization that would have no real performance benefit. If you can provide benchmarks that prove otherwise, I would be happy to look at it more. The current code is aiming for readability over maximum possible performance. After all, the biggest bottleneck will be the actual database fetch and return. 🤓

Wow... don't use this code... Turns out that PHP's strtr is not an efficiently implemented native algorithm after all. I assumed it would just loop over the whole string ONCE and replace each encountered character. But it must be doing a lot more work, because look at this:

<?php

$str = 'L%orem \ipsum d%ol%or \\s\it a_met, c%onsectetur a_d\ip\isc\ing el\it, sed d%o e\iusm%od temp%or \inc\id\idunt ut la_b%ore et d%ol%ore ma_gna_ a_l\iqua_. Ut \\en\im a_d m\in\im ven\ia_m, qu\is n%ostrud exerc\ita_t\i%on ulla_mc%o la_b%or\is n\is\i ut a_l\iqu\ip ex ea_ c%omm%od%o c%onsequa_t. Du\is a_ute \irure d%ol%or \in reprehender\it \in v%olupta_te vel\it esse c\illum d%ol%ore eu fug\ia_t nulla_ pa_r\ia_tur. Excepteur s\int %occa_eca_t cup\ida_ta_t n%on pr%o\ident, sunt \in culpa_ qu\i %off\ic\ia_ deserunt m%oll\it a_n\im \id est la_b%orum.';

function escapeA($value) {
    // Backslash is used to escape wildcards.
    $value = str_replace('\\', '\\\\', $value);
    // Standard wildcards are underscore and percent sign.
    $value = str_replace('%', '\\%', $value);
    $value = str_replace('_', '\\_', $value);

    return $value;
}

function escapeB($value) {
    return strtr($value, ['\\' => '\\\\', '%' => '\\%', '_' => '\\_']);
}

function escapeC($value) {
    static $search = ['\\', '%', '_'];
    static $replace = ['\\\\', '\\%', '\\_'];

    return str_replace($search, $replace, $value);
}

printf("escapeA === escapeB: %s\n", escapeA($str) === escapeB($str) ? 'true' : 'false');
printf("escapeA === escapeC: %s\n", escapeA($str) === escapeC($str) ? 'true' : 'false');

$iterations = 100000;

$start = microtime(true);
for ($i = 0; $i < $iterations; ++$i) {
    $x = escapeA($str);
}
$elapsed = 1000 * (microtime(true) - $start);
printf("escapeA: %.3f ms\n", $elapsed);

$start = microtime(true);
for ($i = 0; $i < $iterations; ++$i) {
    $x = escapeB($str);
}
$elapsed = 1000 * (microtime(true) - $start);
printf("escapeB: %.3f ms\n", $elapsed);

$start = microtime(true);
for ($i = 0; $i < $iterations; ++$i) {
    $x = escapeC($str);
}
$elapsed = 1000 * (microtime(true) - $start);
printf("escapeC: %.3f ms\n", $elapsed);

Result:

PHP7:
escapeA === escapeB: true
escapeA === escapeC: true
escapeA: 559.096 ms
escapeB: 627.883 ms
escapeC: 546.817 ms

PHP5.6:
escapeA === escapeB: true
escapeA === escapeC: true
escapeA: 699.019 ms
escapeB: 1470.605 ms
escapeC: 686.515 ms

It just goes to show that inefficiently implemented NATIVE algorithms can be slower than "inefficiently" (str-repeat multiple times) script-algorithms...

And interestingly, a single "multi-replace str_repeat" is only marginally faster... and won't be noticeable.

I would probably go with escapeC() for clarity though. But since the difference is tiny, you could keep the existing instead... 😄