This provides some C++ wrapper functions to provide some useful utilities like regex match and regex replace.
#Requirements:- PCRE2 library (
version >=10.21
). - C++ compiler with C++11 support.
If the required PCRE2 version is not available in the official channel, download my fork of the library from here, Or use this repository which will always be kept compatible with JPCRE2.
#Install/Include:It can be installed as a separate library or can be used directly in a project by including the appropriate sources:
- jpcre2.h
- jpcre2.cpp
- jpcre2_match.cpp
- jpcre2_replace.cpp
An example compile/build command with GCC would be:
g++ mycpp.cpp jpcre2_match.cpp jpcre2_replace.cpp jpcre2.cpp jpcre2.h -lpcre2-8
If your PCRE2 library is not in the standard library path, then add the path:
g++ -std=c++11 mycpp.cpp ... -L/path/to/your/pcre2/library -lpcre2-8
Note that it requires the PCRE2 library installed in your system. If it is not already installed and linked in your compiler, you will need to link it with appropriate path and options.
Installing JPCRE2 as a library:
To install it in a Unix based system, run:
./configure
make
sudo make install
Now to use it:
#include <jpcre2.h>
in your code.- Build/compile by linking with JPCRE2 and PCRE2 library.
An example command for GCC would be:
g++ mycpp.cpp -ljpcre2-8 -lpcre2-8 #sequence is important
If you are in a non-Unix system (e.g Windows), build a library from the JPCRE2 sources with your favourite IDE or use it as it is.
Note:
- PCRE2_CODE_UNIT_WIDTH other than 8 is not supported in this version.
- To use the PCRE2 POSIX compatible library, add the
-lpcre2-posix
along with the others.
#How to code with JPCRE2:
-
First create a
jpcre2::Regex
object. This object will hold the pattern, modifiers, compiled pattern, error and warning codes.Each object for each regex pattern.jpcre2::Regex re; //Create object
-
Compile the pattern and catch any error exception:
try{ re.compile() //Invoke the compile() function .pattern(pat) //set various parameters .modifiers("Jin") //... .jpcre2Options(0) //... .pcre2Options(0) //... .execute(); //Finaly execute it.
//Another way is to use constructor to initialize and compile at the same time: jpcre2::Regex re2("pattern2","mSi"); //S is an optimization mod.
} catch(int e){ /Handle error/ std::cout<<re.getErrorMessage(e)<<std::endl; }
-
Now you can perform match or replace against the pattern. Use the
match()
member function to preform regex match and thereplace()
member function to perform regex replace. -
Match: The
match()
member function can take an optional argument (subject) and returns an object of the class RegexMatch which then in turn can be used to pass various parameters using available member functions (method chaining) of RegexMatch class. The end function in the method chain should always be theexecute()
function which returns the result (number of matches found). -
Perform match and catch any error exception:
Access the substrings like this:
jpcre2::VecNum vec_num; try{ size_t count=re.match(subject) //Invoke the match() function .modifiers(ac_mod) //Set various options .numberedSubstringVector(vec_num) //... .jpcre2Options(jpcre2::VALIDATE_MODIFIER) //... .execute(); //Finally execute it. //vec_num will be populated with maps of numbered substrings. //count is the total number of matches found } catch(int e){ /*Handle error*/ std::cout<<re.getErrorMessage(e)<<std::endl; }
for(size_t i=0;i<vec_num.size();i++){ //This loop will iterate only once if find_all is false. //i=0 is the first match found, i=1 is the second and so forth for(auto const& ent : vec_num[i]){ //ent.first is the number/position of substring found //ent.second is the substring itself //when ent.first is 0, ent.second is the total match. } }
-
To get named substrings or name to number mapping, simply pass the appropriate vectors with
numberedSubstringVector()
and/ornamedSubstringVector()
and/ornameToNumberMapVector()
:And access the substrings by looping through the vectors and associated maps. The size of all three vectors are the same and can be accessed in the same way.jpcre2::VecNum vec_num; ///Vector to store numbured substring Map. jpcre2::VecNas vec_nas; ///Vector to store named substring Map. jpcre2::VecNtN vec_ntn; ///Vector to store Named substring to Number Map. std::string ac_mod="g"; // g is for global match. Equivalent to using findAll() or FIND_ALL in jpcre2Options() try{ re.match(subject) //Invoke the match() function .modifiers(ac_mod) //Set various options .numberedSubstringVector(vec_num) //... .namedSubstringVector(vec_nas) //... .nameToNumberMapVector(vec_ntn) //... .jpcre2Options(jpcre2::VALIDATE_MODIFIER) //... .pcre2Options(PCRE2_ANCHORED) //... .execute(); //Finally execute it. } catch(int e){ /*Handle error*/ std::cout<<re.getErrorMessage(e)<<std::endl; }
-
Replace: The
replace()
member function can take upto two optional arguments (subject and replacement string) and returns an object of the class RegexReplace which then in turn can be used to pass various parameters using available member functions (method chaining) of RegexReplace class. The end function in the method chain should always be theexecute()
function which returns the result (replaced string). -
Perform replace and catch any error exception:
try{ std::cout<< re.replace() //Invoke the replace() function .subject(s) //Set various parameters .replaceWith(s2) //... .modifiers("gE") //... .jpcre2Options(0) //... .pcre2Options(0) //... .execute(); //Finally execute it. //gE is the modifier passed (global and unknown-unset-empty). //Access substrings/captured groups with ${1234},$1234 (for numbered substrings) // or ${name} (for named substrings) in the replacement part i.e in replaceWith() } catch(int e){ /*Handle error*/ std::cout<<re.getErrorMessage(e)<<std::endl; }
-
If you pass the size of the resultant string with
bufferSize()
function, then make sure it will be enough to store the whole resultant replaced string, otherwise the internal replace function (pcre2_substitute()
) will be called twice to adjust the size of the buffer to hold the whole resultant string in order to avoidPCRE2_ERROR_NOMEMORY
error.
#Insight:
Let's take a quick look what's inside and how things are working here:
###Namespaces:
- jpcre2_utils : Some utility functions used by JPCRE2.
- jpcre2 : This is the namespace you will be using in your code to access JPCRE2 classes and functions.
###Classes:
- Regex : This is the main class which holds the key utilities of JPCRE2. Every regex needs an object of this class.
- RegexMatch: This is the class that holds all the useful functions to perform regex match according to the compiled pattern.
- RegexReplace: This is the class that holds all the useful functions to perform replacement according to the compiled pattern.
###Functions at a glance:
//Class Regex
String getModifier()
String getPattern()
String getLocale() ///Gets LC_CTYPE
uint32_t getCompileOpts() ///Returns the compile opts used for compilation
///Error handling
String getErrorMessage(int err_num)
String getErrorMessage()
String getWarningMessage()
int getErrorNumber()
int getErrorCode()
PCRE2_SIZE getErrorOffset()
Regex& compile(const String& re,const String& mod)
Regex& compile(const String& re="")
Regex& pattern(const String& re)
Regex& modifiers(const String& x)
Regex& locale(const String& x)
Regex& jpcre2Options(uint32_t x)
Regex& pcre2Options(uint32_t x)
void execute() //executes the compile operation.
RegexMatch& match()
RegexReplace& replace()
//Class RegexMatch
RegexMatch& numberedSubstringVector(VecNum& vec_num)
RegexMatch& namedSubstringVector(VecNas& vec_nas)
RegexMatch& nameToNumberMapVector(VecNtN& vec_ntn)
RegexMatch& subject(const String& s)
RegexMatch& modifiers(const String& s)
RegexMatch& jpcre2Options(uint32_t x=NONE)
RegexMatch& pcre2Options(uint32_t x=NONE)
RegexMatch& findAll()
SIZE_T execute() //executes the match operation
//Class RegexReplace
RegexReplace& subject(const String& s)
RegexReplace& replaceWith(const String& s)
RegexReplace& modifiers(const String& s)
RegexReplace& jpcre2Options(uint32_t x=NONE)
RegexReplace& pcre2Options(uint32_t x=NONE)
RegexReplace& bufferSize(PCRE2_SIZE x)
String execute() //executes the replacement operation
JPCRE2 uses modifiers to control various options, type, behavior of the regex and its' interactions with different functions that uses it. Two types of modifiers are available: compile modifiers and action modifiers:
- Compile modifiers: Modifiers that are used to compile a regex. They define the behavior of a regex pattern. The modifiers have more or less the same meaning as the PHP regex modifiers except for
e, j and n
(marked with *). The available compile modifiers are:
- e* : Unset back-references in the pattern will match to empty strings. Equivalent to PCRE2_MATCH_UNSET_BACKREF.
- i : Case-insensitive. Equivalent to PCRE2_CASELESS option.
- j* :
\u \U \x
and unset back-referencees will act as JavaScript standard.\U
matches an upper case "U" character (by default it causes a compile time error if this option is not set).\u
matches a lower case "u" character unless it is followed by four hexadecimal digits, in which case the hexadecimal number defines the code point to match (by default it causes a compile time error if this option is not set).\x
matches a lower case "x" character unless it is followed by two hexadecimal digits, in which case the hexadecimal number defines the code point to match (By default, as in Perl, a hexadecimal number is always expected after\x
, but it may have zero, one, or two digits (so, for example,\xz
matches a binary zero character followed by z) ).- Unset back-references in the pattern will match to empty strings.
- m : Multi-line regex. Equivalent to PCRE2_MULTILINE option.
- n* : Enable Unicode support for
\w \d
etc... in pattern. Equivalent to PCRE2_UTF | PCRE2_UCP. - s : If this modifier is set, a dot meta-character in the pattern matches all characters, including newlines. Equivalent to PCRE2_DOTALL option.
- u : Enable UTF support.Treat pattern and subjects as UTF strings. It is equivalent to PCRE2_UTF option.
- x : Whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, enables commentary in pattern. Equivalent to PCRE2_EXTENDED option.
- A : Match only at the first position. It is equivalent to PCRE2_ANCHORED option.
- D : A dollar meta-character in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. Equivalent to PCRE2_DOLLAR_ENDONLY option.
- J : Allow duplicate names for subpatterns. Equivalent to PCRE2_DUPNAMES option.
- S : When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching/replacing. It may also be beneficial for a very long subject string or pattern. Equivalent to an extra compilation with JIT_COMPILER with the option PCRE2_JIT_COMPLETE.
- U : This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by
?
. Equivalent to PCRE2_UNGREEDY option.
- Action modifiers: Modifiers that are used per action i.e match or replace. These modifiers are not compiled in the regex itself, rather it is used per call of each function. Available action modifiers are:
- A : Match at start. Equivalent to PCRE2_ANCHORED. Can be used in match operation. Setting this option only at match time (i.e regex was not compiled with this option) will disable optimization during match time.
- e : Replaces unset group with empty string. Equivalent to PCRE2_SUBSTITUTE_UNSET_EMPTY. Can be used in replace operation.
- E : Extension of e modifier. Sets even unknown groups to empty string. Equivalent to PCRE2_SUBSTITUTE_UNSET_EMPTY | PCRE2_SUBSTITUTE_UNKNOWN_UNSET.
- g : Global. Will perform global matching or replacement if passed.
- x : Extended replacement operation. It enables some Bash like features:
${<n>:-<string>}
${<n>:+<string1>:<string2>}
<n>
may be a group number or a name. The first form specifies a default value. If group <n>
is set, its value is inserted; if not, <string>
is expanded and the result is inserted. The second form specifies strings that are expanded and inserted when group <n>
is set or unset, respectively. The first form is just a convenient shorthand for ${<n>:+${<n>}:<string>}
.
###JPCRE2 options:
These options are meaningful only for the JPCRE2 library itself not the original PCRE2 library. We use the jpcre2Options()
function to pass these options.
- jpcre2::NONE: This is the default option. Equivalent to 0 (zero).
- jpcre2::VALIDATE_MODIFIER: If this option is passed, modifiers will be subject to validation check. If any of them is invalid then a
jpcre2::ERROR::INVALID_MODIFIER
error exception will be thrown. You can get the error message withgetErrorMessage(error_code)
member function. - jpcre2::FIND_ALL: This option will do a global matching if passed during matching. The same can be achieved by passing the 'g' modifier with
modifiers()
function.
###PCRE2 options:
While having its own way of doing things, JPCRE2 also supports the traditional PCRE2 options to be passed. We use the pcre2Options()
function to pass the PCRE2 options. These options are the same as the PCRE2 library and have the same meaning. For example instead of passing the 'g' modifier to the replacement operation we can also pass its PCRE2 equivalent PCRE2_SUBSTITUTE_GLOBAL to have the same effect.
#Testing:
- test_match.cpp: Contains an example code for match function.
- test_replace.cpp: Contains an example code for replace function.
- test_match2.cpp: Another matching example. The makefile creates a binary of this (jpcre2match).
- test_replace2.cpp: Another replacement example. The makefile creates a binary of this (jpcre2replace).
#Screenshots of some test outputs:
subject = "(I am a string with words and digits 45 and specials chars: ?.#@ 443 অ আ ক খ গ ঘ 56)"
pattern = "(?:(?<word>[?.#@:]+)|(?<word>\\w+))\\s*(?<digit>\\d+)"