C++ wrapper for PCRE2 library
PCRE2 is the name used for a revised API for the PCRE library, which is a set of functions, written in C, that implement regular expression pattern matching using the same syntax and semantics as Perl, with just a few differences. Some features that appeared in Python and the original PCRE before they appeared in Perl are also available using the Python syntax.
This provides some C++ wrapper classes/functions to perform regex operations such as regex match and regex replace.
You can read the complete documentation here or download it from jpcre2-doc repository.
- PCRE2 library (
version >=10.21
).
If the required PCRE2 version is not available in the official channel, you can download my fork of the library.
This is a header only library. All you need to do is include the header jpcre2.hpp
in your program.
#include "jpcre2.hpp"
Notes:
jpcre2.hpp
#includespcre2.h
, thus you don't need to includepcre2.h
manually in your program.- If
pcre2.h
is in a non-standard path then you may include it beforejpcre2.hpp
with correct path (you will need to definePCRE2_CODE_UNIT_WIDTH
before includingpcre2.h
in this case) - There's no need to define
PCRE2_CODE_UNIT_WIDTH
before includingjpcre2.hpp
. - On windows, if you are working with a static PCRE2 library, you must define
PCRE2_STATIC
before includingjpcre2.hpp
(or beforepcre2.h
if you included it manually).
Install:
You can copy this header to a standard include directory (folder) so that it becomes available from a standard include path.
On Unix you can do:
./configure
make
make install #(may require root privilege)
It will check if all dependencies are satisfied and install the header in a standard include path.
Compile/Build:
Compile/Build your code with corresponding PCRE2 libraries linked. For 8-bit code unit width, you need to link with 8-bit library, for 16-bit, 16-bit library and so on. If you want to use multiple code unit width, link against all 8-bit, 16-bit and 32-bit libraries. See code unit width and character type for details.
Example compilation with g++:
g++ main.cpp -lpcre2-8
g++ multi.cpp -lpcre2-8 -lpcre2-16 -lpcre2-32
If PCRE2 is not installed in the standard path, add the path with -L
option:
g++ main.cpp -L/my/library/path -lpcre2-8
Performing a match or replacement against regex pattern involves two steps:
- Compiling the pattern
- Performing the match or replacement operation
Select a character type according to the library you want to use. In this doc we are going to use 8 bit library as reference and we will use char
as the character type. If char
in your system is 16-bit you will have to link against 16-bit library instead, same goes for 32-bit. Other bit sizes are not supported by PCRE2.
Let's use a typedef to shorten the code:
typedef jpcre2::select<char> jp;
// You have to select the basic data type (char, wchar_t, char16_t or char32_t)
(You can use temporary objects too, see short examples).
This object will hold the pattern, options and compiled pattern.
jp::Regex re;
Each object for each regex pattern.
re.setPattern("(?:(?<word>[?.#@:]+)|(?<word>\\w+))\\s*(?<digit>\\d+)") //set pattern
.addModifier("iJ") //add modifier (J for PCRE2_DUPNAMES)
.compile(); //Finally compile it.
//Do not use setModifier() after adding any modifier/s, it will reset them.
//Another way is to use constructor to initialize and compile at the same time:
jp::Regex re2("pattern2","mSi"); //S is an optimization mod.
jp::Regex re3("pattern3", PCRE2_ANCHORED);
jp::Regex re4("pattern4", PCRE2_ANCHORED, jpcre2::JIT_COMPILE);
Now you can perform match or replace against the pattern. Use the RegexMatch::match()
function to perform regex match and the RegexReplace::replace()
member function to perform regex replace.
You can check if the regex was compiled successfully or not, but it's not necessary. A match against a non-compiled regex will give you 0 match and for replace you will be returned the exact same subject string that you passed.
if(!re) std::cout<<"Failed";
else std::cout<<"successfull";
The if(re)
conditional is only available for >= C++11
:
if(re) std::cout<<"Success";
else std::cout<<"Failure";
For < C++11
, you can use the double bang trick as an alternative to if(re)
:
if(!!re) std::cout<<"Success";
else std::cout<<"Failure";
Match is generally performed using the jp::RegexMatch::match()
function.
For convenience, a shortcut function in Regex
is available: jp::Regex::match()
. It can take upto three arguments. It uses a temporary match object to perform the match.
To get match results, you will need to pass vector pointers that will be filled with match data.
jp::Regex re("\\w+ect");
if(re.match("I am the subject")) //always uses a new temporary match object
std::cout<<"matched (case sensitive)";
else
std::cout<<"Didn't match";
//For case insensitive match, re-compile with modifier 'i'
re.addModifier("i").compile();
if(re.match("I am the subjEct")) //always uses a new temporary match object
std::cout<<"matched (case insensitive)";
else
std::cout<<"Didn't match";
size_t count = jp::Regex("[aijst]","i").match("I am the subject","g"); //always uses a new temporary match object
The g
modifier performs global match.
To get the match results, you need to pass appropriate vector pointers. This is an example of how you can get the numbered substrings/captured groups from a match:
jp::VecNum vec_num;
jp::RegexMatch rm;
size_t count=rm.setRegexObject(&re) //set associated Regex object
.setSubject(&subject) //set subject string
.addModifier(ac_mod) //add modifier
.setNumberedSubstringVector(&vec_num) //pass pointer to VecNum vector
.match(); //Finally perform the match.
//vec_num will be populated with vectors of numbered substrings.
//count is the total number of matches found
You can access a substring/captured group by specifying their index (position):
std::cout<<vec_num[0][0]; // group 0 in first match
std::cout<<vec_num[0][1]; // group 1 in first match
std::cout<<vec_num[1][0]; // group 0 in second match
To get named substring and/or name to number mapping, pass pointer to the appropriate vectors with jp::RegexMatch::setNamedSubstringVector()
and/or jp::RegexMatch::setNameToNumberMapVector()
before doing the match.
jp::VecNum vec_num; ///Vector to store numbered substring vector.
jp::VecNas vec_nas; ///Vector to store named substring Map.
jp::VecNtN vec_ntn; ///Vector to store Named substring to Number Map.
std::string ac_mod="g"; // g is for global match. Equivalent to using setFindAll() or FIND_ALL in addJpcre2Option()
jp::RegexMatch rm;
rm.setRegexObject(&re)
.setSubject(&subject) //set subject string
.addModifier(ac_mod) //add modifier
.setNumberedSubstringVector(&vec_num) //pass pointer to vector of numbered substring vectors
.setNamedSubstringVector(&vec_nas) //pass pointer to vector of named substring maps
.setNameToNumberMapVector(&vec_ntn) //pass pointer to vector of name to number maps
.match(); //Finally perform the match()
std::cout<<vec_nas[0]["name"]; // captured group by name in first match
std::cout<<vec_nas[1]["name"]; // captured group by name in second match
If you need this information, you should have passed a jp::VecNtN
pointer to jp::RegexMatch::setNameToNumberMapVector()
function before doing the match (see above).
std::cout<<vec_ntn[0]["name"]; // position of captured group 'name' in first match
You can iterate through the matches for numbered substrings (jp::VecNum
) like this:
for(size_t i=0;i<vec_num.size();++i){
//i=0 is the first match found, i=1 is the second and so forth
for(size_t j=0;j<vec_num[i].size();++j){
//j=0 is the capture group 0 i.e the total match
//j=1 is the capture group 1 and so forth.
std::cout<<"\n\t"<<j<<": "<<vec_num[i][j]<<"\n";
}
}
You can iterate through named substrings (jp::VecNas
) like this:
for(size_t i=0;i<vec_nas.size();++i){
//i=0 is the first match found, i=1 is the second and so forth
for(jp::MapNas::iterator ent=vec_nas[i].begin();ent!=vec_nas[i].end();++ent){
//ent->first is the number/position of substring found
//ent->second is the substring itself
//when ent->first is 0, ent->second is the total match.
std::cout<<"\n\t"<<ent->first<<": "<<ent->second<<"\n";
}
}
If you are using >=C++11
, you can make the loop a lot simpler:
for(size_t i=0;i<vec_nas.size();++i){
for(auto const& ent : vec_nas[i]){
std::cout<<"\n\t"<<ent.first<<": "<<ent.second<<"\n";
}
}
jp::VecNtN
can be iterated through the same way as jp::VecNas
.
Every match object needs to be associated with a Regex object. A match object without regex object associated with it, will always give 0 match.
jp::RegexMatch rm;
rm.setRegexObject(&re);
//Another way is to use constructor
jp::RegexMatch rm1(&re);
size_t count = rm.setSubject("subject")
.setModifier("g")
.match();
The RegexMatch
class stores a pointer to its' associated Regex object. If the content of the associated Regex object is changed, it will be reflected on the next operation/result.
Regex replace is generally performed using the jp::RegexReplace::replace()
function.
However, a convenience shortcut function is available in Regex class: jp::Regex::replace(subject, replacewith, modifier)
. It uses a temporary replace object to perform the replacement.
//Using a temporary regex object
std::cout<<jp::Regex("\\d+").replace("I am digits 1234 0000","5678", "g");
//'g' modifier is for global replacement
//1234 and 0000 gets replaced with 5678
jp::RegexReplace rr;
std::cout<<
rr.setRegexObject(&re) //set associated Regex object
.setSubject(&s) //Set various parameters
.setReplaceWith(&s2) //...
.addModifier("gE") //...
.addJpcre2Option(0) //...
.addPcre2Option(0) //...
.replace(); //Finally do the replacement.
//gE is the modifier passed (global and unknown-unset-empty).
//Access substrings/captured groups with ${1234},$1234 (for numbered substrings)
// or ${name} (for named substrings) in the replacement part i.e in setReplaceWith()
Every replace object needs to be associated with a Regex object. A replace object not associated with any Regex object will perform no replacement and return the same subject string that was given.
jp::RegexReplace rr;
rr.setRegexObject(&re);
//Another way is to use constructor
jp::RegexReplace rr1(&re);
rr.setSubject("subjEct")
.setReplaceWith("me")
.setModifier("g")
.replace();
The RegexReplace
class stores a pointer to its' associated Regex object. If the content of the associated Regex object is changed, it will be reflected on the next operation/result.
The jp::RegexReplace
class has two replace functions: jp::RegexReplace::replace()
and jp::RegexReplace::nreplace()
. Both of them can take a jp::MatchEvaluator
instance as argument and perform the replace operation according to the callback function set in the MatchEvaluator class.
And those two are just wrappers of jp::MatchEvaluator::replace()
and jp::MatchEvaluator::nreplace()
. Using these functions directly, one can re-use existing match data for new replacement operation without doing the match again. Though, this facility comes with some quirks, see Re-use match data section.. By default all replace functions do a new match every time and re-create the match data.
The first function mentioned (replace()
) above, is for PCRE2 compatible replacement which uses pcre2_substitute
to process the replacement string returned by the callback function, where the second one (nreplace()
) uses a native approach without using pcre2_substitute
which treats the string returned by the callback function as literal.
The class MatchEvaluator
implements several constructor overloads to take different callback functions. Also, there are setter functions which allow changing the callback functions if desired.
The callback function takes exactly three positional arguments. If you don't need one or more arguments, you may pass void*
in their respective positions in the argument list.
The callback function:
jp::String callback1(const jp::NumSub& m1, void*, void*){
return "("+m1[0]+")";
}
then,
jp::Regex re("(?<total>\\w+)", "n");
jp::RegexReplace rr;
String s3 = "I am a string 879879 fdsjkll ১ ২ ৩ ৪ অ আ ক খ গ ঘ";
rr.setRegexObject(&re)
.setSubject(&s3)
.setModifier("g");
std::cout<<"Result:\n"<<
rr.nreplace(jp::MatchEvaluator(callback1)); //replace() function can take the same argument
Detailed examples are in the testme.cpp file.
std::cout<<"Result:\n"<<
rr.nreplace(jp::MatchEvaluator
(
[](const jp::NumSub& m1, const jp::MapNas& m2, void*){
return "("+m1[0]+"/"+m2.at("total")+")";
}
));
//replace() function can take the same argument
Replacement can be done with only MatchEvaluator:
std::cout<<"Result:\n"<<
jp::MatchEvaluator(callback1).setSubject(&s3)
.setRegexObject(&re)
.setModifier("g")
.nreplace();
//replace() function can take the same argument
A MatchEvaluator
object can be created using one of its many constructors. Callback functions can be provided with the constructors or can be changed later with jp::MatchEvaluator::setCallback()
function. If no callback function is set/given, then the default callback function is jp::callback::erase()
which deletes matched part/s from the subject string.
jp::MatchEvaluator me; //default callback jp::callback::erase
me.setRegexObject(&re).setSubject(&sub).nreplace(); //this will remove matched parts from sub.
jp::MatchEvaluator me1(callback1); //arbitrary callback function.
jp::MatchEvaluator me2(&re); //default callback jp::callback::erase
me2.setSubject(sub).nreplace(); //this will remove matched parts from sub.
It is possible to use existing match data to perform replacement without performing a new match operation.
Safest way but not the best:
jp::MatchEvaluator me(jp::callback::fill); //this callback implements all vectors and does not modify subject string.
//Now you need to populate the vectors with match data:
me.setSubject(&sub).setRegexObject(&re).match();
//Now that we have all the match data we need, we can use it to perform replacement according to
//different callback functions:
me.setCallback(callback2).nreplace(false); //'false' tells nreplace() to not perform new match.
me.setCallback(callback3).nreplace(false);
//etc..
Best but not the safest:
Instead of creating data for all vectors, you can do it as necessary, but it requires you to be vigilant about what you are doing:
jp::MatchEvaluator me; //no vector with jp::callback::erase callback
me.setSubject(sub).setRegexObject(&re); //no data yet.
Let's say, we have a callback cb3
that implements NumSub and MapNas and we do this:
me.setCallback(cb3).nreplace();
//this creates match data for NumSub and MapNas and performs the replacement.
Now, if we want to perform the replacement with a different callback function cb2
which implements only MapNas or NumSub or both, we can re-use the data created above:
me.setCallback(cb2).nreplace(false);
If we want to use a callback function cb4
which implements jp::MapNtN
, we can not re-use the existing data because there is no data for jp::MapNtn
yet. (it will give assertion error if we try). Thus we will need to do the match again:
me.setCallback(cb4).nreplace(); //creating data again and performing replacement.
After the above operation, all the vectors are filled with data (missing jp::MapNtn
was created), consequently, we can use any callback function we want at this stage because we have all the data that we will need.
Thus a callback cb7
that implements all match data vectors can be used without doing the match again:
me.setCallback(cb7).nreplace(false); //OK, as we have all the data we need.
Quirks:
- Changes in replace related option takes effect without a re-match.
- Changes in match related option (e.g start offset) needs a re-match to take effect.
- To re-use existing match data, callback function must be compatible with the data, otherwise it's an assertion failure.
- If the associated Regex object or subject string changes, a new match must be performed, trying to use the existing match data in such cases is undefined behavior.
Make sure you at least understand the #3 and #4 points above before going for practical implementation of re-using match data. see jpcre2::select::MatchEvaluator for details
JPCRE2 uses a default set of modifier to provide an easy path to setting different options for different operations. There are three basic operations, namely compile, match and replace and thus the set is divided into three subset of modifiers. For convenience, we call them modifier tables.
If the default modifier table is not suitable for your application, you may use a custom modifier table instead of the default one. The jpcre2::ModifierTable
class provides this interface. (note the namespace, it's directly under jpcre2
).
All modifier strings are parsed and converted to equivalent PCRE2 and JPCRE2 options on the fly. If you don't want it to spend any time parsing modifier then pass the equivalent option directly with one of the many variants of
addJpcre2Option()
andaddPcre2Option()
functions.
Types of modifiers:
- Compile modifier
- Match modifier
- Replace modifier
All of the modifiers above can be divided further into two categories:
- Unique modifier
- Combined or mixed modifier (e.g 'n', 'E')
These modifiers define the behavior of a regex pattern (they are integrated in the compiled regex). They have more or less the same meaning as the PHP regex modifiers except for e, j and n
(marked with *).
Modifier | Details |
---|---|
e * |
Unset back-references in the pattern will match to empty strings. Equivalent to PCRE2_MATCH_UNSET_BACKREF . |
i |
Case-insensitive. Equivalent to PCRE2_CASELESS option. |
j * |
\u \U \x and unset back-references will act as JavaScript standard. Equivalent to PCRE2_ALT_BSUX | PCRE2_MATCH_UNSET_BACKREF.
|
m |
Multi-line regex. Equivalent to PCRE2_MULTILINE option. |
n * |
Enable Unicode support for \w \d etc... in pattern. Equivalent to PCRE2_UTF | PCRE2_UCP. |
s |
If this modifier is set, a dot meta-character in the pattern matches all characters, including newlines. Equivalent to PCRE2_DOTALL option. |
u |
Enable UTF support.Treat pattern and subjects as UTF strings. It is equivalent to PCRE2_UTF option. |
x |
Whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, enables commentary in pattern. Equivalent to PCRE2_EXTENDED option. |
A |
Match only at the first position. It is equivalent to PCRE2_ANCHORED option. |
D |
A dollar meta-character in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. Equivalent to PCRE2_DOLLAR_ENDONLY option. |
J |
Allow duplicate names for sub-patterns. Equivalent to PCRE2_DUPNAMES option. |
S |
When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching/replacing. It may also be beneficial for a very long subject string or pattern. Equivalent to an extra compilation with JIT_COMPILER with the option PCRE2_JIT_COMPLETE . |
U |
This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by ? . Equivalent to PCRE2_UNGREEDY option. |
These modifiers are not compiled in the regex itself, rather they are used per call of each match or replace function.
Modifier | Action | Details |
---|---|---|
A |
match | Match at start. Equivalent to PCRE2_ANCHORED . Can be used in match operation. Setting this option only at match time (i.e regex was not compiled with this option) will disable optimization during match time. |
e |
replace | Replaces unset group with empty string. Equivalent to PCRE2_SUBSTITUTE_UNSET_EMPTY . |
E |
replace | Extension of e modifier. Sets even unknown groups to empty string. Equivalent to `PCRE2_SUBSTITUTE_UNSET_EMPTY |
g |
match replace |
Global. Will perform global matching or replacement if passed. Equivalent to jpcre2::FIND_ALL for match and PCRE2_SUBSTITUTE_GLOBAL for replace. |
x |
replace | Extended replacement operation. Equivalent to PCRE2_SUBSTITUTE_EXTENDED . It enables some Bash like features:${<n>:-<string>} ${<n>:+<string1>:<string2>} <n> may be a group number or a name. The first form specifies a default value. If group <n> is set, its value is inserted; if not, <string> is expanded and the result is inserted. The second form specifies strings that are expanded and inserted when group <n> is set or unset, respectively. The first form is just a convenient shorthand for ${<n>:+${<n>}:<string>} . |
Modifier table is an instance of the jpcre2::ModifierTable
class. You can bind this table with any of the compile, match and replace related class objects. Different objects can have different tables.
Examples:
/* ***************************
* Compile modifier table
* ***************************/
//character table is either std::string or const char* (not jp::String)
std::string nametab = "IJMS"; //arbitrary modifier characters.
//now the option values sequentially
jpcre2::Uint valtab[] = { PCRE2_CASELESS, PCRE2_DUPNAMES, PCRE2_MULTILINE, jpcre2::JIT_COMPILE };
//if the above two doesn't have the same number of elements, the behavior is undefined.
//init ModifierTable
jpcre2::ModifierTable mdt; //creates empty table.
//change the Compile modifier table only:
mdt.setCompileModifierTable(nametab, valtab);
//now bind the table with the object
jp::Regex re;
re.setModifierTable(&mdt);
//let's perform a compile
re.compile("JPCRE2","I"); //now I is PCRE2_CASELESS and small 'i' is an invalid modifier.
For details, see the testmd.cpp
file.
JPCRE2 allows both PCRE2 and native JPCRE2 options to be passed. PCRE2 options are recognized by the PCRE2 library itself.
These options are meaningful only for the JPCRE2 library, not the original PCRE2 library. We use the jp::Regex::addJpcre2Option()
family of functions to pass these options.
Option | Details |
---|---|
jpcre2::NONE |
This is the default option. Equivalent to 0 (zero). |
jpcre2::FIND_ALL |
This option will do a global match if passed during matching. The same can be achieved by passing the 'g' modifier with jp::RegexMatch::addModifier() function. |
jpcre2::JIT_COMPILE |
This is same as passing the S modifier during pattern compilation. |
We use the jp::Regex::addPcre2Option()
family of functions to pass the PCRE2 options. These options are the same as the PCRE2 library and have the same meaning. For example instead of passing the 'g' modifier to the replacement operation we can also pass its PCRE2 equivalent PCRE2_SUBSTITUTE_GLOBAL
to have the same effect. Passing these options directly will be faster than passing modifiers.
This is where deviations from the PCRE2 specification will be laid out.
Details | PCRE2 | JPCRE2 |
---|---|---|
Different name for same group | not supported (10.21 ) |
supported (>=10.30.01 ) |
The bit size of character type must match with the PCRE2 library you are linking against. There are three PCRE2 libraries according to code unit width, namely 8, 16 and 32 bit libraries. So, if you use a character type (e.g char
which is generally 8 bit) of 8-bit code unit width then you will have to link your program against the 8-bit PCRE2 library. If it's 16-bit character, you will need 16-bit library. If you use a combination of various code unit width supported or use all of them, you will have to link your program against their corresponding PCRE2 libraries. Missing library will yield to compile time error.
Implementation defined behavior:
Size of integral types (char
, wchar_t
, char16_t
, char32_t
) is implementation defined. char
may be 8, 16, 32 or 64 (not supported) bit. Same goes for wchar_t
and others. In Linux wchar_t
is 32 bit and in windows it's 16 bit.
JPCRE2 codes are portable in regards of code unit width. Your program gets compiled according to the code unit width defined by your system. Consider the following example, where you do:
#include <jpcre2.hpp>
typedef jpcre2::select<char> jp;
int main(){
jp::Regex re;
///other things
// ...
return 0;
}
This is what will happen when you compile:
- In a system where
char
is 8 bit, it will use 8-bit library and UTF-8 in UTF-mode. - In a system where
char
is 16 bit, it will use 16-bit library and UTF-16 in UTF-mode. - In a system where
char
is 32 bit, it will use 32-bit library and UTF-32 in UTF-mode. - In a system where
char
is not 8 or 16 or 32 bit, it will yield compile error.
If you don't want to be so aware of the code unit width of the character type/s you are using, link your program against all PCRE2 libraries. The code unit width will be handled automatically and if anything unsupported is encountered, you will get compile time error.
A common example in this regard can be the use of wchar_t
:
jpcre2::select<wchar_t>::Regex re;
- In windows, the above code will use 16-bit library and UTF-16 in UTF mode.
- In Linux, the above code will use 32-bit library and UTF-32 in UTF mode.
For portable code, instead of using the standard names std::string
or such, use jp::String
(you may further typedef it as String
or whatever). It will be defined to an appropriate string class according to the basic character type you selected and thus provide all the functionalities and conveniences you get with std::string
and such string classes. Being said that, there's no harm if you use the standard names (std::string
etc...). Using jp::String
will just ensure that you are using the correct string class for the correct character type you selected. If you need to use the basic character type, use jp::Char
.
Instead of using full names like std::vector<std::string>
and such for storing match result, use the typedefs:
jp::NumSub
: Equivalent tostd::vector<jp::String>
jp::MapNas
: Equivalent tostd::map<jp::String, jp::String>
(You can set arbitrary map (e.gstd::unordered_map
) instead ofstd::map
when using>=C++11
)jp::MapNtN
: Equivalent tostd::map<jp::String, size_t>
(You can set arbitrary map (e.gstd::unordered_map
) instead ofstd::map
when using>=C++11
)jp::VecNum
: Equivalent tostd::vector<jp::NumSub>
jp::VecNas
: Equivalent tostd::vector<jp::MapNas>
jp::VecNtN
: Equivalent tostd::vector<jp::MapNtN>
jpcre2::VecOff
: Equivalent tostd::vector<size_t>
(note the namespace, it's directly underjpcre2
)
Other typedefs are mostly for internal use
- You should not use the
jpcre2::Ush
as unsigned short. In JPCRE2 context, it is the smallest unsigned integer type to cover at least the numbers from 1 to 126. jpcre2::Uint
is a fixed width unsigned integer type and will be at least 32 bit wide.jpcre2::SIZE_T
is the same asPCRE2_SIZE
which is defined assize_t
.jpcre2::VecOpt
is defined asstd::vector<jpcre2::Uint>
.
When a known error is occurred during pattern compilation or match or replace, the error number and error offsets are set to corresponding variables of the respective classes. You can get the error number, error offset and error message with getErrorNumber()
, getErrorOffset()
and getErrorMessage()
functions respectively. These functions are available for all three classes.
Note that, these errors always gets overwritten by previous error, so you only get the last error that occurred.
Also note that, these errors never get re-initialized (set to zero), they are always there even when everything else worked great (except some previous error).
If you do experiment with various erroneous situations, make use of the resetErrors()
function. You can call it from anywhere in your method chain and immediately set the errors to zero. This function is also defined for all three classes to reset their corresponding errors.
JPCRE2 asserts some errors with descriptive error messages. These errors are mistakes in your code and not to be shipped to the client without fixing.
In no situation these errors should be bypassed by #define NDEBUG
before including jpcre2.hpp
. You should investigate the error message and fix the cause.
When there is no such errors in your finalized code, you may use
#define NDEBUG
to strip out these assertions.
JPCRE2 treats null as valid input and its usage have well-defined behavior throughout JPCRE2 interface. Most of the time a null is treated as 'set something to its initial or empty state'. And also, initial state doesn't necessarily have to be an empty state, and empty state doesn't necessarily have to be an initial state. It depends on what you are working with, refer to the doc when you are in a bind.
As an example, if null is passed with setSubject()
, then the subject is set to its initial state which is empty (not null).
Another example is, when a null is passed to the setRegexObject()
function, it literally sets the Regex object to null, which is actually the initial state for that calling object.
Giving a null to std::string
(and such) constructor is undefined behavior. But you don't need to worry about it with JPCRE2, if it's too much to type Two double quotes (""
) to pass an empty string to a JPCRE2 function, you can just use 0
, it's perfectly fine. But it's a bad practice, so just use this statement as a safety measure.
Note: JPCRE2 is supposed to be completely null safe, i.e no undefined behavior for null input. So, if you find any loophole or bug that makes this statement invalid, please report it.
(C) MT safe: All functions in JPCRE2 library are MT safe provided that the instances calling those functions are themselves thread safe.
When we say '(C) MT safe' or simply 'thread safe' throughout this doc, we mean the above definition of Conditional Multi-Thread safety.
- There is no data race between two separate objects (
Regex
,RegexMatch
,RegexReplace
etc..) because the classes do not contain any static variables. - Temporary class objects are always thread safe.
- Temporary class object that uses another third party object reference or pointer is thread safe provided that the access to the third party object is thread safe.
- Simultaneous access of the same object is MT unsafe. You can use mutex lock or other mechanisms to ensure thread safety.
Examples:
The following function is thread safe:
typedef jpcre2::select<char> jp;
void* thread_safe_fun1(void* arg){//uses no global or static variable, thus thread safe.
jp::Regex re("\\w", "g");
jp::RegexMatch rm(&re); //It's a local variable
rm.setSubject("fdsf").setModifier("g").match();
return 0;
}
The following function is thread safe for joined thread only:
jp::Regex rec("\\w", "g"); //thread unsafe.
void *thread_pseudo_safe_fun1(void *arg){
//uses global variable 'rec', but uses
//mutex lock, thus thread safe when the thread is joined with the main thread.
//But, when thread is detached from the main thread, it won't be thread safe any more,
//because, the main thread can destroy the rec object while possibly being used by the detached child thread.
pthread_mutex_lock( &mtx );
jp::RegexMatch rm(&rec);
rm.setSubject("fdsf").setModifier("g").match();
pthread_mutex_unlock( &mtx);
return 0;
}
Example multi-threaded programs are provided in src/test_pthread.cpp and src/teststdthread.cpp. The thread safety of these programs are tested with Valgrind (drd
tool). See Test suit for more details on the test.
- To use JPCRE2 in its full capability (including
>=C++11
features), use latest compilers with fullC++11
support. - If you do not use
>=C++11
, you will be OK with older compilers.
Examples and test programs are available in src/test*.cpp
files.
File | Containing examples |
---|---|
test0.cpp |
Handling std::string and std::wstring . |
test16.cpp |
Performing regex match and regex replace with std::wstring and std::u16string . |
test32.cpp |
Performing regex match and regex replace with std::wstring and std::u32string . |
test_match.cpp |
Performing regex match against a pattern and getting the match count and match results. Shows how to iterate over the match results to get the captured groups/substrings. |
test_match2.cpp |
Contains an example to take subject string, pattern and modifier from user input and perform regex match using JPCRE2. |
testmd.cpp |
Examples of working with modifier table. |
testme.cpp |
Examples of using MatchEvaluator to perform replace. |
test_replace.cpp |
Example of doing regex replace. |
test_replace2.cpp |
Contains an example to take subject string, replacement string, modifier and pattern from user input and perform regex replace with JPCRE2 |
test_pthread.cpp |
Multi threaded examples with POSIX pthread. |
teststdthread.cpp |
Multi threaded examples with std::thread . |
test_shorts.cpp |
Contains some short examples. |
size_t count;
//Check if string matches the pattern
/*
* The following uses a temporary Regex object.
*/
if(jp::Regex("(\\d)|(\\w)").match("I am the subject"))
std::cout<<"\nmatched";
/*
* Using the modifier S (i.e jpcre2::JIT_COMPILE) with temporary object may or may not give you
* any performance boost (depends on the complexity of the pattern). The more complex
* the pattern gets, the more sense the S modifier makes.
*/
//If you want to match all and get the match count, use the action modifier 'g':
std::cout<<"\n"<<
jp::Regex("(\\d)|(\\w)","m").match("I am the subject","g");
/*
* Modifiers passed to the Regex constructor or with compile() function are compile modifiers
* Modifiers passed with the match() or replace() functions are action modifiers
*/
// Substrings/Captured groups:
/*
* *** Getting captured groups/substring ***
*
* captured groups or substrings are stored in maps/vectors for each match,
* and each match is stored in a vector.
* Thus captured groups are in a vector of maps/vectors.
*
* PCRE2 provides two types of substrings:
* 1. numbered (indexed) substring
* 2. named substring
*
* For the above two, we have two vectors respectively:
* 1. jp::VecNum (Corresponding vector: jp::NumSub)
* 2. jp::VecNas (Corresponding map: jp::MapNas)
*
* Another additional vector is available to get the substring position/number
* for a particular captured group by name. It's a vector of name to number maps
* * jp::VecNtN (Corresponding map: jp:MapNtN)
*/
// ***** Get numbered substring ***** ///
jp::VecNum vec_num;
jp::RegexMatch rm;
jp::Regex re("(\\w+)\\s*(\\d+)","m");
count =
jp::RegexMatch(&re).setSubject("I am 23, I am digits 10")
.setModifier("g")
.setNumberedSubstringVector(&vec_num)
.match();
/*
* count (the return value) is guaranteed to give you the correct number of matches,
* while vec_num.size() may give you wrong result if any match result
* was failed to be inserted in the vector. This should not happen
* i.e count and vec_num.size() should always be equal.
*/
std::cout<<"\nNumber of matches: "<<count/* or vec_num.size()*/;
//Now vec_num is populated with numbered substrings for each match
//The size of vec_num is the total match count
//vec_num[0] is the first match
//The type of vec_num[0] is jp::NumSub
std::cout<<"\nTotal match of first match: "<<vec_num[0][0];
std::cout<<"\nCaptured group 1 of first match: "<<vec_num[0][1];
std::cout<<"\nCaptured group 2 of first match: "<<vec_num[0][2];
//captured group 3 doesn't exist, (with operator [] it's a segfault)
//std::cout<<"\nCaptured group 3 of first match: "<<vec_num[0][3];
//Using at() will throw std::out_of_range exception
//~ try {
//~ std::cout<<"\nCaptured group 3 of first match: "<<vec_num[0].at(3);
//~ } catch (const std::out_of_range& e) {
//~ std::cerr<<"\n"<<e.what();
//~ }
//There were two matches found (vec_num.size() == 2) in the above example
std::cout<<"\nTotal match of second match: "<<vec_num[1][0]; //Total match (group 0) from second match
std::cout<<"\nCaptured group 1 of second match: "<<vec_num[1][1]; //captured group 1 from second match
std::cout<<"\nCaptured group 2 of second match: "<<vec_num[1][2]; //captured group 2 from second match
// ***** Get named substring ***** //
jp::VecNas vec_nas;
jp::VecNtN vec_ntn; // We will get name to number map vector too
re.compile("(?<word>\\w+)\\s*(?<digit>\\d+)","m");
count =
jp::RegexMatch(&re).setSubject("I am 23, I am digits 10")
.setModifier("g")
//.setNumberedSubstringVector(vec_num) // We don't need it in this example
.setNamedSubstringVector(&vec_nas)
.setNameToNumberMapVector(&vec_ntn) // Additional (name to number maps)
.match();
std::cout<<"\nNumber of matches: "<<vec_nas.size()/* or count */;
//Now vec_nas is populated with named substrings for each match
//The size of vec_nas is the total match count
//vec_nas[0] is the first match
//The type of vec_nas[0] is jp::MapNas
std::cout<<"\nCaptured group (word) of first match: "<<vec_nas[0]["word"];
std::cout<<"\nCaptured group (digit) of first match: "<<vec_nas[0]["digit"];
//Trying to access a non-existence named substirng with [] operator will give you empty string
//If the existence of a substring is important, use the std::map::find() or std::map::at()
//(>=C++11) function to access map elements.
/* //>=C++11
try{
///This will throw exception because the substring name 'name' doesn't exist
std::cout<<"\nCaptured group (name) of first match: "<<vec_nas[0].at("name");
} catch(const std::logic_error& e){
std::cerr<<"\nCaptured group (name) doesn't exist";
}*/
//There were two matches found (vec_nas.size() == 2) in the above example
std::cout<<"\nCaptured group (word) of second match: "<<vec_nas[1]["word"];
std::cout<<"\nCaptured group (digit) of second match: "<<vec_nas[1]["digit"];
//Get the position (number) of a captured group name (that was found in match)
std::cout<<"\nPosition of captured group (word) in first match: "<<vec_ntn[0]["word"];
std::cout<<"\nPosition of captured group (digit) in first match: "<<vec_ntn[0]["digit"];
/*
* Replacement Examples
* Replace pattern in a string with a replacement string
*
* The Regex::replace() function can take a subject and replacement string as argument.
*
* You can also pass the subject with setSubject() function in method chain,
* replacement string with setReplaceWith() function in method chain, etc ...
* A call to RegexReplace::replace() in the method chain will return the resultant string
*/
std::cout<<"\n"<<
//replace first occurrence of a digit with @
jp::Regex("\\d").replace("I am the subject string 44", "@");
std::cout<<"\n"<<
//replace all occurrences of a digit with @
jp::Regex("\\d").replace("I am the subject string 44", "@", "g");
//swap two parts of a string
std::cout<<"\n"<<
jp::Regex("^([^\t]+)\t([^\t]+)$")
.replace("I am the subject\tTo be swapped according to tab", "$2 $1");
//Doing the above with method chain:
re.compile("^([^\t]+)\t([^\t]+)$");
jp::RegexReplace(&re).setSubject("I am the subject\tTo be swapped according to tab")
.setReplaceWith("$2 $1")
.replace();
jpcre2::select
no longer permits explicit bit size specification.ConvInt
,ConvUTF
(includingConvert16
andConvert32
) is removed.jpcre2::select
can take an optional second template parameter to specify the map container.- Macro
JPCRE2_DISABLE_CHAR1632
andJPCRE2_DISABLE_CODE_UNIT_WIDTH_VALIDATION
is removed.
For complete changes see the changelog file
Some test programs are written to check for major flaws like segfault, memory leak and crucial input/output validation. Before trying to run the tests, make sure you have all 3 PCRE2 libraries installed on your system.
For the simplest (minimal) test, run:
#You can add --enable-cpp11 to test cpp11 features.
./configure --enable-test
make check
To check with valgrind
, run:
#requires valgrind to be installed on the system
#You can add --enable-cpp11 to test cpp11 features.
./configure --enable-valgrind
make check
To check the multi threaded examples with drd
, run:
#requires valgrind to be installed on the system
#You can add --enable-cpp11 to test cpp11 features.
./configure --enable-thread-check
make check
To prepare a coverage report, run:
#requires lcov and genhtml to be installed on the system
#enable cpp11 to cover cpp11 codes.
#clean any previous make
make distclean #ignore errors
./configure --enable-coverage --enable-cpp11
make coverage
The configure script generated by autotools checks for availability of several programs and let's you set several options to control your testing environment. These are the options supported by configure scipt:
Option | Details |
---|---|
--[enable/disable]-test |
Enable/Disable test suit. |
--[enable/disable]-cpp11 |
Enable/Disable building tests with C++11 features. |
--[enable/disable]-valgrind |
Enable/Disable valgrind test (memory leak test). |
--[enable/disable]-thread-check |
Enable/Disable thread check on multi threaded examples. |
--[enable/disable]-coverage |
Enable/Disable coverage report. |
--[enable/disable]-silent-rules |
Enable/Disable silent rules (enabled by default). You will get prettified make output if enabled. |
Please do all pull requests against the master branch. The default branch is 'release' which is not where continuous development of JPCRE2 is done.
If you find any error in the documentation or confusing/misleading use of terms, or anything that cathces your eye and feels not right, please open an issue in the issue page. Or if you want to fix it and do pull request then use the master branch.
This page is generated from doxy/doxydoc.md file, thus changing the README.md file will have no impact.
This project comes with a BSD LICENCE, see the LICENCE file for more details.
It is not necessary to let me know which project you are using this library on, but an optional choice. I would very much appreciate it, if you let me know about the name (and short description if applicable) of the project. So if you have the time, please send me an email.