Performance comparison of regular expression engines

the original test environment by dark100 at http://sljit.sourceforge.net/regex_perf.html

best view [here](https://nasciiboy.github.io/raptorVSworld/)

Introduction

Processing text or raw byte-sequence are among the common tasks performed by most software tools. These tasks usually involve pattern matching algorithms, and the most popular tool for such purpose is regular expressions. Regular expressions has been evolved a lot since Kleene defined the regular sets in the 1950’s. Today we have several widely used regular expression engines which have different features which makes any performance comparison a difficult task, since a faster engine is not necessary better. Depending on the use case it might be enough to know whether a POSIX compatible regular expression matches to a line, even the position of the match is unneeded (grep utility). On the contrary other use cases require the position of capturing brackets, unicode support, conditional and atomic block (handling a byte sequence as a single character, like ‘sch’ in German language) support just to name a few. The former case needs a less sophisticated algorithm, which is likely be much faster than the latter, but again, that does not mean the former is better. More about these engine types can be found here.

Participants

The following popular engines were choosen:

and

Before anyone jump to any conclusions, I should note the followings:

  • The engines were not fine tuned (because of my lack of knowledge about their internal workings). I just compiled them with the default options. I know enabling or disabling some features can heavily affect the results. If you feel that you have a better configuration just drop me an e-mail and I will update the results (mailto:nasciiboy@gmail.com).
  • The regular expression engines are compiled with -O3 to allow the best performance.
  • This comparison page was inspired by the work of John Maddock (See his own regex comparison here). The input is also the same he used before: mtent12.zip. It is a text file (e-book) which size is about 20 Mbytes.
  • Only common patterns are selected, they are not pathological cases nor have any PERL specific features. The comparison was caseful.

Results

x86-64 bit Intel Cerelon 847 1.1GHz (GCC 7.2.0, Go 1.9.1 – GNU/Linux)

Regular expressiontrere2Golangre2pcre-JITpcre-DFApcreonigregexp4Golangregexp4regexp3
".|\n"
"."
6486ms (19285324)34159ms (19285324)9072ms (19285324)1130ms (19285324)6677ms (19285324)4448ms (19285324)12876ms (19285324)1553ms (19285324)605ms (19285324)1808ms (19285324)
"\d"
":d"
1044ms (27084)3166ms (27084)134ms (27084)55ms (27084)68ms (27084)65ms (27084)127ms (27084)1340ms (27084)594ms (27084)1853ms (27084)
"\S"
":S"
4633ms (15451664)32530ms (15451664)7405ms (15451664)931ms (15451664)4593ms (15451664)3086ms (15451664)10453ms (15451664)2056ms (15451664)810ms (15451664)1912ms (15451664)
"\S+"
":S+"
2565ms (3414592)8000ms (3414592)1957ms (3414592)333ms (3414592)1619ms (3414592)924ms (3414592)3114ms (3414592)1521ms (3414592)654ms (3414592)1059ms (3414592)
"\w"
":w"
4769ms (14751878)32506ms (14751878)7052ms (14751878)952ms (14751878)4537ms (14751878)2994ms (14751878)10865ms (14751878)1811ms (14750958)819ms (14750958)1888ms (14750958)
"[:w_]"
"\w"
4762ms (14751878)29682ms (14751878)7066ms (14751878)952ms (14751878)4486ms (14751878)2989ms (14751878)10663ms (14751878)2472ms (14751878)1160ms (14751878)3171ms (14751878)
"[a-zA-Z0-9_]"
"[a-zA-Z0-9_]"
4650ms (14751878)29437ms (14751878)7121ms (14751878)994ms (14751878)4538ms (14751878)3074ms (14751878)10483ms (14751878)2735ms (14751878)1104ms (14751878)5260ms (14751878)
"[a-zA-Z]+"
"[a-zA-Z]+"
2317ms (3495761)7582ms (3495761)2048ms (3495761)327ms (3495761)1551ms (3495761)1007ms (3495761)2956ms (3495761)2250ms (3495761)860ms (3495761)2771ms (3495761)
"[.\s]+"
"[.:s]+"
2205ms (991813)9541ms (3430783)2036ms (3430783)399ms (3430783)1037ms (3430783)949ms (3430783)2676ms (3430783)2851ms (3430783)1363ms (3430783)3741ms (3430783)
"([^\n]+)"
"<[^\n]+>"
1591ms (314387)5611ms (314387)430ms (314387)82ms (314387)1038ms (314387)210ms (314387)697ms (314387)1301ms (314387)578ms (314387)788ms (314387)
"e"
"e"
520ms (1781425)2842ms (1781425)1000ms (1781425)139ms (1781425)439ms (1781425)380ms (1781425)1434ms (1781425)1739ms (1781425)663ms (1781425)1809ms (1781425)
"(((((e)))))"
"<<<<<e>>>>>"
529ms (1781425)4399ms (1781425)1001ms (1781425)203ms (1781425)1269ms (1781425)1256ms (1781425)1971ms (1781425)6168ms (1781425)2360ms (1781425)19939ms (1781425)
"((((((((((e))))))))))"
"<<<<<<<<<<e>>>>>>>>>>"
525ms (1781425)6709ms (1781425)1006ms (1781425)298ms (1781425)1850ms (1781425)1945ms (1781425)2176ms (1781425)12171ms (1781425)4034ms (1781425)58376ms (1781425)
"Twain"
"Twain"
1022ms (2388)12ms (2388)8ms (2388)49ms (2388)47ms (2388)10ms (2388)52ms (2388)1610ms (2388)589ms (2388)2450ms (2388)
"(Twain)"
"<Twain>"
1015ms (2388)13ms (2388)8ms (2388)49ms (2388)48ms (2388)14ms (2388)53ms (2388)2294ms (2388)927ms (2388)6027ms (2388)
"(?i)Twain"
"#*Twain"
1337ms (2657)3515ms (2657)166ms (2657)51ms (2657)287ms (2657)203ms (2657)349ms (2657)1616ms (2657)727ms (2657)2575ms (2657)
"(?i:Tw)(?:(?:a?A?[i_I]{0,2})?[nN])"
"(Tw)#*((a?A?[i_I]{0,2})?[nN])"
1375ms (2989)3474ms (2989)168ms (2989)56ms (2989)319ms (2989)236ms (2989)353ms (2989)2154ms (2989)1021ms (2989)9216ms (2989)
"[a-z]shing"
"[a-z]shing"
1597ms (1877)3614ms (1877)277ms (1877)47ms (1877)2302ms (1877)1487ms (1877)49ms (1877)3250ms (1877)1267ms (1877)5413ms (1877)
"Huck[a-zA-Z]+|Saw[a-zA-Z]+"
"Huck[a-zA-Z]+|Saw[a-zA-Z]+"
1779ms (396)4431ms (396)126ms (396)8ms (396)72ms (396)70ms (396)117ms (396)3894ms (396)1456ms (396)6599ms (396)
"[a-q][^u-z]{13}x"
"[a-q][^u-z]{13}x"
4446ms (5021)11181ms (5021)613ms (5021)5ms (5021)6169ms (5021)1742ms (5021)135ms (5021)9796ms (5021)3347ms (5021)10391ms (5021)
"Tom|Sawyer|Huckleberry|Finn"
"Tom|Sawyer|Huckleberry|Finn"
3151ms (3015)9003ms (3015)134ms (3015)89ms (3015)96ms (3015)97ms (3015)141ms (3015)6608ms (3015)2608ms (3015)10659ms (3015)
"(Tom|Sawyer|Huckleberry|Finn)"
"<Tom|Sawyer|Huckleberry|Finn>"
3302ms (3015)8876ms (3015)131ms (3015)87ms (3015)100ms (3015)102ms (3015)142ms (3015)7240ms (3015)2964ms (3015)23149ms (3015)
"[hHeELlOo][hHeELlOo][hHeELlOo][hHeELlOo][hHeELlOo]"
"[hHeELlOo][hHeELlOo][hHeELlOo][hHeELlOo][hHeELlOo]"
3656ms (534)4503ms (534)250ms (534)247ms (534)889ms (534)638ms (534)687ms (534)2903ms (534)1589ms (534)10916ms (534)
"Tom.{10,25}river|river.{10,25}Tom"
1844ms (2)4541ms (2)152ms (2)44ms (2)247ms (2)211ms (2)236ms (2)N/AN/AN/A
"ing[^a-zA-Z]"
"ing[^a-zA-Z]"
1149ms (85956)228ms (85956)111ms (85956)54ms (85956)243ms (85956)140ms (85956)140ms (85956)1611ms (85956)627ms (85956)3131ms (85956)
"[a-zA-Z]ing[^a-zA-Z]"
"[a-zA-Z]ing[^a-zA-Z]"
1859ms (85823)3884ms (85823)309ms (85823)56ms (85823)2376ms (85823)1549ms (85823)143ms (85823)3274ms (85823)1328ms (85823)6761ms (85823)
"([a-zA-Z]+ing)"
2070ms (95863)5411ms (95863)318ms (95863)213ms (95863)6616ms (95863)4174ms (95863)2603ms (95863)N/AN/AN/A

x86-64 bit AMD A6-9500 3.8GHz (GCC 7.2.1, Go 1.9.1 – GNU/Linux)

Regular expressiontrere2Golangre2pcre-JITpcre-DFApcreonigregexp4Golangregexp4regexp3
".|\n"
"."
2039ms (19285324)12477ms (19285324)3361ms (19285324)594ms (19285324)1673ms (19285324)1379ms (19285324)4188ms (19285324)559ms (19285324)190ms (19285324)660ms (19285324)
"\d"
":d"
347ms (27084)1064ms (27084)46ms (27084)16ms (27084)19ms (27084)20ms (27084)43ms (27084)478ms (27084)253ms (27084)669ms (27084)
"\S"
":S"
1468ms (15451664)9661ms (15451664)2779ms (15451664)339ms (15451664)1040ms (15451664)964ms (15451664)3649ms (15451664)721ms (15451664)287ms (15451664)709ms (15451664)
"\S+"
":S+"
908ms (3414592)2793ms (3414592)728ms (3414592)113ms (3414592)501ms (3414592)303ms (3414592)943ms (3414592)531ms (3414592)200ms (3414592)368ms (3414592)
"\w"
":w"
1483ms (14751878)9036ms (14751878)2637ms (14751878)345ms (14751878)1001ms (14751878)927ms (14751878)3387ms (14751878)637ms (14750958)305ms (14750958)693ms (14750958)
"\w"
"[:w_]"
1479ms (14751878)9049ms (14751878)2627ms (14751878)345ms (14751878)1003ms (14751878)920ms (14751878)3364ms (14751878)849ms (14751878)433ms (14751878)1190ms (14751878)
"[a-zA-Z0-9_]"
"[a-zA-Z0-9_]"
1476ms (14751878)9898ms (14751878)2636ms (14751878)346ms (14751878)1016ms (14751878)966ms (14751878)3384ms (14751878)920ms (14751878)390ms (14751878)1953ms (14751878)
"[a-zA-Z]+"
"[a-zA-Z]+"
782ms (3495761)3242ms (3495761)744ms (3495761)116ms (3495761)491ms (3495761)292ms (3495761)977ms (3495761)750ms (3495761)301ms (3495761)927ms (3495761)
"[.\s]+"
"[.:s]+"
658ms (991813)3080ms (3430783)684ms (3430783)117ms (3430783)329ms (3430783)276ms (3430783)865ms (3430783)955ms (3430783)484ms (3430783)1338ms (3430783)
"([^\n]+)"
"<[^\n]+>"
562ms (314387)1909ms (314387)149ms (314387)32ms (314387)342ms (314387)64ms (314387)234ms (314387)433ms (314387)174ms (314387)246ms (314387)
"e"
"e"
166ms (1781425)952ms (1781425)339ms (1781425)54ms (1781425)149ms (1781425)120ms (1781425)440ms (1781425)581ms (1781425)231ms (1781425)679ms (1781425)
"(((((e)))))"
"<<<<<e>>>>>"
166ms (1781425)1472ms (1781425)347ms (1781425)94ms (1781425)433ms (1781425)381ms (1781425)540ms (1781425)2126ms (1781425)824ms (1781425)7649ms (1781425)
"((((((((((e))))))))))"
"<<<<<<<<<<e>>>>>>>>>>"
166ms (1781425)2267ms (1781425)342ms (1781425)136ms (1781425)841ms (1781425)617ms (1781425)678ms (1781425)3981ms (1781425)1465ms (1781425)24102ms (1781425)
"Twain"
"Twain"
334ms (2388)5ms (2388)3ms (2388)12ms (2388)16ms (2388)4ms (2388)13ms (2388)559ms (2388)211ms (2388)1030ms (2388)
"(Twain)"
"<Twain>"
333ms (2388)5ms (2388)3ms (2388)12ms (2388)16ms (2388)5ms (2388)14ms (2388)781ms (2388)336ms (2388)2506ms (2388)
"#*Twain"
"(?i)Twain"
445ms (2657)1204ms (2657)65ms (2657)13ms (2657)96ms (2657)71ms (2657)117ms (2657)543ms (2657)246ms (2657)1061ms (2657)
"(?i:Tw)(?:(?:a?A?[i_I]{0,2})?[nN])"
"(Tw)#*((a?A?[i_I]{0,2})?[nN])"
458ms (2989)1191ms (2989)68ms (2989)25ms (2989)102ms (2989)80ms (2989)113ms (2989)729ms (2989)334ms (2989)3617ms (2989)
"[a-z]shing"
"[a-z]shing"
524ms (1877)1320ms (1877)90ms (1877)11ms (1877)750ms (1877)498ms (1877)12ms (1877)1073ms (1877)436ms (1877)1837ms (1877)
"Huck[a-zA-Z]+|Saw[a-zA-Z]+"
"Huck[a-zA-Z]+|Saw[a-zA-Z]+"
548ms (396)1511ms (396)42ms (396)4ms (396)23ms (396)23ms (396)39ms (396)1271ms (396)494ms (396)2483ms (396)
"[a-q][^u-z]{13}x"
"[a-q][^u-z]{13}x"
1467ms (5021)3540ms (5021)359ms (5021)3ms (5021)1985ms (5021)588ms (5021)40ms (5021)3306ms (5021)1190ms (5021)3356ms (5021)
"Tom|Sawyer|Huckleberry|Finn"
"Tom|Sawyer|Huckleberry|Finn"
931ms (3015)2794ms (3015)44ms (3015)24ms (3015)30ms (3015)31ms (3015)47ms (3015)2295ms (3015)864ms (3015)4132ms (3015)
"(Tom|Sawyer|Huckleberry|Finn)"
"<Tom|Sawyer|Huckleberry|Finn>"
937ms (3015)2956ms (3015)44ms (3015)26ms (3015)31ms (3015)33ms (3015)47ms (3015)2511ms (3015)1011ms (3015)9374ms (3015)
"[hHeELlOo][hHeELlOo][hHeELlOo][hHeELlOo][hHeELlOo]"
"[hHeELlOo][hHeELlOo][hHeELlOo][hHeELlOo][hHeELlOo]"
1104ms (534)1526ms (534)90ms (534)81ms (534)288ms (534)203ms (534)227ms (534)954ms (534)594ms (534)3782ms (534)
"Tom.{10,25}river|river.{10,25}Tom"
604ms (2)1520ms (2)57ms (2)20ms (2)76ms (2)71ms (2)76ms (2)N/AN/AN/A
"ing[^a-zA-Z]"
"ing[^a-zA-Z]"
379ms (85956)76ms (85956)41ms (85956)23ms (85956)79ms (85956)48ms (85956)39ms (85956)592ms (85956)222ms (85956)1255ms (85956)
"[a-zA-Z]ing[^a-zA-Z]"
"[a-zA-Z]ing[^a-zA-Z]"
639ms (85823)1341ms (85823)103ms (85823)24ms (85823)767ms (85823)485ms (85823)39ms (85823)1143ms (85823)461ms (85823)2355ms (85823)
"([a-zA-Z]+ing)"
717ms (95863)1903ms (95863)106ms (95863)75ms (95863)1766ms (95863)1424ms (95863)925ms (95863)N/AN/AN/A

Compile

Deps

  • bash
  • make
  • gcc
  • g++
  • go

get data

  1. $ wget http://www.gutenberg.org/files/3200/old/mtent12.zip
  2. $ dos2unix mtent12.txt data.txt
  3. $ dos2unix data.txt
  4. $ rm mtent12*

build benchmarks

  1. $ ./rebuild.sh
  2. $ ./rebench.sh

clean

$ ./reclean.sh