pcre2_regexec is not thread-safe
ChezBunch opened this issue · 10 comments
Traditionally regexec
is thread-safe, meaning once a given regex is compiled into a regex_t
, it can be used simultaneously in different threads.
But pcre2_regexec
is not thread-safe: since regex_t
contains a pcre2_match_data
and that object is modified during matching, two threads cannot execute the same regex_t
at the same time.
We were bitten hard by this limitation in a heavily threaded program (various regexs failing unexpectedly and even some SIGSEGV), since we did not expect such a behavior:
- The pcre2posix documentation is silent on the matter
- The libc version used previously was thread-safe
pcre2_regexec
expecting a pointer to aconst regex_t *
seems to imply that the POSIX API is thread-safe regarding to a compiled regex- The native API is thread-safe regarding to the matching
Should the POSIX layer be modified to be thread-safe? Should the pcre2posix documentation be updated?
In the meantime, we modified our pcre2_regexec
version to allocate and free a pcre2_match_data
object on each invocation, but this means some performances degradation. Another option would be to embed a mutex in the regex_t
object.
Oh dear. The POSIX wrapper was always an "added extra" for PCRE, trying to map the POSIX API onto PCRE's native one. I suppose (I can't remember) that attaching a match_data object to the compiled regex seemed like a good way of saving a malloc/free for each match, and it was probably me that didn't think about multithreading. I do not like the idea of using a mutex because all the PCRE2 code should be simple Standard C for use in many environments. Allocating and freeing match_data obviously works, and I cannot see any other way of making this thread safe, My inclination is to document the fact that it is not thread safe, and encourage multithreading applications to use the native API, but what does anybody else think?
any other way of making this thread safe
something I recommended before for multithreaded use was to allocate the match data as a thread local and reuse it on each thread. The native API makes that easy to do because the match data is a parameter for the match function, so maybe providing a way to give that to the "POSIX" API might help as well.
I might be wrong, but AFAIK regex_t
is an opaque type so that might be feasible without changing the API, although it won't be backward compatible and would be awkward to use, since the match data needs to match the number of pairs that the expression needs, or at least have more available that what the matching function will use.
use the native API,
isn't this problem also present in there though, because of the heapframes for the interpreter since 10.41 if the match_data is shared?
I think another "gotcha" that is also undocumented is the fact that our regoff_t
is an int
(probably because it was what GNU did, and because pcre1 also used that for its offset), pcre2 uses PCRE2_SIZE (aka size_t
) which would be a better fit for POSIX 1003.1-2008, but we have no way to change that as of now, and therefore the wrapper is not really compatible with that version of the standard AFAIK.
The documentation now states that the POSIX functions are not thread-safe, so I am closing this issue.
While I agree this is "closed", I think the current "solution" is suboptimal.
I assume there might be some people out there that would rather have this interface to be really POSIX compatible (including of course being thread safe) to use it, or NOT having the option to use it, since apparently "assuming" it should be POSIX compatible just because of the name seems to be common.
I think I am beginning to regret having introduced this wrapper.
regret
maybe for pcre3 we can remove it, and make it an independent library that uses PCRE, then it can be made to be C11 or higher and therefore making it POSIX compatible would be very simple.
Indeed, it could even had some backward compatibility layer all the way to C89, GNU89 or POSIX if needed.
I am sure someone might even argue it is a fun way to learn some rust ;)
maybe for pcre3
Pre-emptively, can I strongly suggest never releasing anything called pcre3 and skipping straight to pcre4?
For historical reasons relating to sonames, old-pcre ended up being called pcre3 in Debian (and thus all Debian's derivatives, including Ubuntu), so re-using that name would cause no end of confusion.
in Debian
wasn't Debian migrating every single PCRE package to use PCRE2?, hopefully by the time we prepare the first PCRE3 release that should be completed, and therefore no risk for confusion.
BTW, something that might be also a good idea for that version, might be adding version numbers to the functions and so the "need" of add a version number to the library name could be avoided alltogether.
curious though why "3" was selected for PCRE1 and "2" for PCRE2, that might be the biggest source of confusion IMHO.
One thing is certain: it won't be me that puts out pcre3.
in Debian
wasn't Debian migrating every single PCRE package to use PCRE2?, hopefully by the time we prepare the first PCRE3 release that should be completed, and therefore no risk for confusion.
We are, but it's proving quite a slow process :( And also, pcre3 will remain in the archives until at least the last Debian release it was in is older than oldoldstable (i.e. quite a long time!)
curious though why "3" was selected for PCRE1 and "2" for PCRE2, that might be the biggest source of confusion IMHO.
Because the then package maintainer was putting the soname in the package name (like we still do for e.g. libc6), and we'd got to 3 before pcre2 was released.