Regular Expression to validate key length leads to compilation error
10n opened this issue · 11 comments
\Laminas\Cache\Storage\Adapter\Redis was upgraded to have maxKeyLength capability based on the server version. It now indicates 512000000
When using a SimpleCacheDecorator with this kind of adapter the validateKey
method will generate PHP warning because of the way the key length is done.
preg_match(): Compilation failed: number too big in {} quantifier at offset 9").
Running into the same issue:
PHP Warning: preg_match(): Compilation failed: number too big in {} quantifier at offset 9 in /vendor/laminas/laminas-cache/src/Psr/SimpleCache/SimpleCacheDecorator.php on line 338
- laminas/laminas-cache-storage-adapter-redis 1.2.0
- laminas/laminas-cache 2.12.1
Reading the PSR-16 specifications it seems the key should not be bigger then 64, This was previously imposed in validateKey
method
// explicit verification
https://github.com/laminas/laminas-cache/blob/2.11.x/src/Psr/SimpleCache/SimpleCacheDecorator.php
// new capabilities verification
https://github.com/laminas/laminas-cache/blob/2.12.x/src/Psr/SimpleCache/SimpleCacheDecorator.php#L337
The fix that I would suggest is to not allow more than 64 characters even if the storage capabilities allows it.
Another issue that I see is that now Storages that may have capability less then 64 characters as maximum length will trigger an exception "The maximum key length capability must allow at least 64 characters".
If the storage must support at least 64 bytes in key and the PSR-16 indicates that the key must not be more than 64 bytes ... then it does not make any sense to have a property that stores the maximumKeyLength
. If the storage is validated, that means the hard limit is by default 64.
I believe that if there is a maximum key validation that means storages with less then 64 bytes for the maximum key length should be allowed.
I think we need an optimization of the check regarding the key length.
I dont think that regex is the best way to verify a maximum length of a string.
PSR-16 states that at least 64 characters have to be supported. Thats the minimum, not the maximum.
You can read about this here: https://www.php-fig.org/psr/psr-16/
Key - A string of at least one character that uniquely identifies a cached item. Implementing libraries MUST support keys consisting of the characters
A-Z
,a-z
,0-9
,_
, and.
in any order in UTF-8 encoding and a length of up to 64 characters. Implementing libraries MAY support additional characters and encodings or longer lengths, but MUST support at least that minimum. Libraries are responsible for their own escaping of key strings as appropriate, but MUST be able to return the original unmodified key string. The following characters are reserved for future extensions and MUST NOT be supported by implementing libraries:{}()/\@:
There are also integration tests which checks that:
https://github.com/php-cache/integration-tests/blob/824633c9547dc3c7ac5faeca03dd82939cdeb73c/src/SimpleCacheTest.php#L422
https://github.com/php-cache/integration-tests/blob/824633c9547dc3c7ac5faeca03dd82939cdeb73c/src/CachePoolTest.php#L122
Presumably the reason for using regex to get the length of the string was to prevent a dep on mb_string
yet allow unicode characters. You could do something like this instead:
if ($this->maximumKeyLength !== Capabilities::UNLIMITED_KEY_LENGTH
&& count(preg_split('/./u', $key, null)) > $this->maximumKeyLength
) {
This is much slower than preg_match
but at least works. For a key 10000 😄's long and 10000 iterations:
preg_split
is ~3spreg_match
is ~0.001s
Doing a length comparison:
mb_strlen
is ~0.3sstrlen
is ~0.0001s
Fixing this problem for the .1% of users who need massive key sizes will negatively impact the 99.9% majority speed-wise.
Perhaps something like this (not unicode accurate but gives a conservative limit). I'm not aware of a PCRE constant that provides a concrete limit to quantifier limits.
protected const UNICODE_KEY_LENGTH_CHECK_LIMIT = 3000;
if ($this->maximumKeyLength !== Capabilities::UNLIMITED_KEY_LENGTH) {
if (
$this->maximumKeyLength > self::UNICODE_KEY_LENGTH_CHECK_LIMIT
&& strlen($key) > $this->maximumKeyLength
) {
// Exception
} else if (preg_match('/^.{'.($this->maximumKeyLength + 1).',}/u', $key)) {
// Exception
}
}
Another approach would be to remove the check entirely for key lengths over a certain size. Rational being that you require modifying the default memory_limit
and have a niche use-case so are likely technical enough to enforce key length yourself. Also, the underlying adapter would throw an exception should you exceed its limits.
protected const UNICODE_KEY_LENGTH_CHECK_LIMIT = 3000;
if (
$this->maximumKeyLength !== Capabilities::UNLIMITED_KEY_LENGTH
&& strlen($key) < self::UNICODE_KEY_LENGTH_CHECK_LIMIT
&& preg_match('/^.{'.($this->maximumKeyLength + 1).',}/u', $key)
) {
// Exception
}
Okay, I've had some investigations and came to a conclusion:
For the 2.x versions, the redis adapter v1 is in charge.
To fix this problem, we wont be able to totally avoid mb_strlen
but I think I got something which at least will decrease the need for it.
However, due to the fact that this component does not yet require mbstring
as an extension, we cannot rely on it. Using the symfony polyfill was an idea, but it uses iconv
which could be missing as well.
So instead of changing the regex handling here, I will prepare a new minor version of the redis adapter which decreases the maximum key length to 65534
which will ensure that the regex of the decorator won't blow up.
In v2 of the redis adapter, the 512 MB will be re-added.
In v3 of the cache component, mbstring
will be required (see #157) and the following code might work good enough:
// This already exists tho
if ($this->maximumKeyLength === Capabilities::UNLIMITED_KEY_LENGTH) {
return;
}
// a single UTF-8 character may use up to 4 characters, so we do a naive calculation here to verify if that limit is exceeded
// keep in mind that for the redis adapter, a cache key must exceed 128000000 characters (128MB) to not pass this check!!!
$maximumNonUnicodeKeyLength = $this->maximumKeyLength / 4;
if (!isset($key[$maximumNonUnicodeKeyLength + 1])) {
return;
}
// current functionality
$validator = function (string $key): bool {
return ! (bool) preg_match('/^.{'. $this->maximumKeyLength . ',}$/u',$key);
};
// if the key length exceeds the maximum quantifier length, use `mb_strlen`
if ($this->maximumKeyLength > 65535) {
$validator = function (string $key): bool {
return mb_strlen($key, 'UTF-8') <= $this->maximumKeyLength;
};
}
if ($validator($key)) {
return;
}
throw new SimpleCacheInvalidArgumentException(sprintf(
'Invalid key "%s" provided; key is too long. Must be no more than %d characters',
$key,
$this->maximumKeyLength
));
What do you think?
Maybe just use if else
to avoid declaring an anonymous function that is immediately overridden ?
From PSR-16, the enforcement of the minimum key length is already achieved with the current code and doesn't exceed regex limits.
Redis allows for 512 MB for key length, which is a size limit, not a character length. I believe the current implementation (https://github.com/laminas/laminas-cache-storage-adapter-redis/pull/14/files) is incorrect as a unicode 512000000 character string exceeds this.
If the adapters returned a max byte size for the key, you can derive the max character length with multiplication and only need strlen
. This removes any headaches with character encoding and puts the onus on the adapter to report its capabilities. Perhaps a flag to say which calculation its using would be helpful for BC.
The goal of caching is performance and mb_strlen
not only requires slower introspection of the source, but it also carries the baggage of a new extension. By changing to byte limit instead of character length you'll be able to achieve validation while retaining speed.
The check for max limit is maybe out of the remit of the PSR-16 code considering as enforcement hasn't been recommended, although the simple error message is convenient.
@SorX14 good point. Using mb_strlen
on a 4-byte character would lead to a result of 1
.
So maybe not depending on mbstring at all would be a good thing. Maybe using utf8_decode
along with strlen
as suggested by @ghostwriter would be a better approach. But utf8 characters which will consume more than a Single Byte wont be detected either as these are converted to a single ?
as of how I understand the documentation of utf8_decode
.
Maybe we keep the maximum key length of redis at 65534
as it should be long enough anyway. If some one needs longer keys, they should redis directly rather than a library. 🤷🏼♂️