regexp: POSIX regexp takes 4 seconds to execute

Question

regexp: POSIX regexp takes 4 seconds to execute

dvyukov opened this issue 9 years ago · 17 comments

The following FindAllString takes 4 seconds to execute. Not sure whether there is a performance problem or not, but the package claims to have linear execution time, so that's 5.5 ms per character which looks like an issue.

package main

import "regexp"

func main() {
    re := regexp.MustCompilePOSIX("|{0x6B27143FFb2C6.0xAfbb212adA07dC16E19-442426454885445967343337981592e-0xFDbfC4aD3F17CFC4Bd80x0C9aedB8cFCAb57bBD9AcDdfE10858856}^{-0363642521013670e-795506009086632}b|^050226227030136e1509816632298701486455993285303619190b|([^!]+{953})|^J4__P17Y#^^{0035.-0610260367-128}^f__R_f6_s_v6_8XLItr7_1_B__gxdp6ko81_tXI_4m94u9_q__78PA7_1___8tct__kY8_94|b5165Jp2_3N_xg3_|$\x02Q9m5C023w_Ej0_guW2%8_90a1\x0fAzPq4d_J__Kre4nkw3G\ftwentyeighttwentynineirtyone|99hSe9M7q1M_jbxbwJY4YFxDZ___Q_t2v___w_NeXertyeight|thirtynine|[AZ-]tyfive|D_||ortyeight|_ki_xbuJvO__102___2_KSS8Rkp2C__|fy||6G5a_15Zgx58__56E0__9bBfkKB_37_b9N__IFZ_n3EMk2__p|T__OFK__O2E3>|ht|$|||isie||efoursevsevnt||o|_6U|eix|_4aN964__A34lcQq|eightyeightthirtyeight|thirtynine_L")
    re.FindAllString("S__c2_Oe_5TK2cNDf2aC_F4y_Vwi____ggH_jR_95b_R8vFs8NEBs_5Ikmf5S8Laexw40C[five|X2R8r5__A9_u__c__O__150G_az_MbK__04XGMmF_R_h_FwYp_g5_2___bic|seven|PS1|nine|71U|b1_U0a_t_4VKEqIRq7_2_g_giC1Hj_2__q9__La__7vmz_GnPT_Y5_iw1Wu__k33V_WW_gi__6__D_Vw_2qb_32|__80Tj6_e9F_PL__c1_03_xm2TK_1820bcWKL4B_8_T6|fourteen|fifteen|[^]\x1bseventeen\x13<nineteen|\t|\x05|twentytwo_|\x12twentyfive|\x16\x00]twentyeight|twentynine|FVSv_pp3cE|thirtyone|\x03^|_|thirtyfive|thirtysix|:^xdigitthree|Wc___7_81DL__2JZ4S1u_lY6U9sN_1tw8bI6__k__f1Tn_gsN_EeQ4H2KR_9r_V_N5_t4XuIpdzt_WD3Uvq_X9_rWw_A__X|five|six\x0f83|G_h|nine|ten|eleventwelve|thirteen|-0xaaaaDe20daBF3baB|_miAsixteen|seventeen|eighteen|nineteenZa:3\"}('(|?b\x00\x00ascii[AZ-]D)e_4bVe___53_08__QX[:\x1b__fXM_C0_bU:]|^{32}\x16'{$}", -1)
}

go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64

Answer 1 · 2015-06-12T10:11:23.000Z

interesting. playground seems to give up.

http://play.golang.org/p/YIrDvZx0Fk

Answer 2 · 2015-06-13T20:09:58.000Z

I put a simplified test case at https://play.golang.org/p/Pkm1dxj2Pq

The problem is the "[^!]+{999}" You can make the regular expression take an arbitrarily long time by just tacking more copies of that string onto the end of the regexp in the simplified example I gave.

The nested repetition currently only causes an regexp parse error when using the PerlX syntax flag.

https://go-review.googlesource.com/11080/

Answer 3 · 2015-06-13T20:23:13.000Z

Thanks, @tolchz

Here is arguably even simpler version, runs for 8 seconds

package main

import (
    "regexp"
    "strings"
)

func main() {
    re := regexp.MustCompilePOSIX("|A+{1000}")
    re.FindAllString(strings.Repeat("A", 999), -1)
}

Answer 4 · 2015-06-13T21:00:30.000Z

CL https://golang.org/cl/11080 mentions this issue.

Answer 5 · 2015-06-15T11:41:35.000Z

More fun can be had too:

package main
import (
    "regexp"
    "strings"
)

func main() {
    re := regexp.MustCompilePOSIX("|A+{1000}|[^B]+{1000}|[^C]+{1000}|[^D]+{1000}")
    re.FindAllString(strings.Repeat("A", 999), -1)
}

real 0m32.585s
user 0m32.637s
sys 0m0.008s

Answer 6 · 2015-07-24T07:45:58.000Z

google/re2@7444e38 is probably RTYI. I could work on porting this to Go if you like. :)

(Ping, @kcc and @rsc.)

Answer 7 · 2015-10-14T16:35:07.000Z

This is working as intended. The repetition is just that, a repetition. If you make that number bigger, the time takes longer. It's linear in the size of the regexp where the size of a{1000} is 1000. It's also linear in the size of the input: O(nm) where n is text size and m is regexp size. This is more or less fundamental.

Answer 8 · 2015-10-14T16:41:49.000Z

It is not linear.
"|A+{500}" takes 0.2s
"|A+{1000}" takes 8s

Answer 9 · 2016-05-23T10:35:05.000Z

It makes sense that runtime is linear in n, but only for a fixed regexp size m.

We must also interpret it as "worst case upper bound is linear", and this isn't a promise that average case will grow linearly.

I ran a simulation with pattern "|A+{m}", with various values of n from 1 to 1000, and m from 1 to 1000, here the result (high durations in red) :

This shows that the worst case is obtained with n = m - 1

We can't write pattern "|A+{m}" with m > 1000, however it must be equivalent (iiuc) to expanded pattern "|(A+A+A+ ... A+)" with m repetitions of A+ .

So I ran a second simulation with n = m - 1 , with various values of m from 1 to 2000 :

re = regexp.MustCompilePOSIX("|(" + strings.Repeat("A+", m) + ")") re.FindAllString(strings.Repeat("A", m-1), -1)

and plotted the runtime on Y axis :

According to the "linear in m and linear in n" promise, we should have worst time proportional to m^2 .

Unfortunatly, the results (for this particular pattern and input string) clearly indicate that worst time is proportional to m^3 .

Answer 10 · 2016-05-23T13:46:34.000Z

@Deleplace very interesting, thanks. I read a paper recently (can't share it yet sorry) that might help us implement counted repetition more efficienctly, but even without that I am a bit surprised by the m^3. I'd like to understand that better, but I'm about to go on leave for the summer. I'll flag this for Go 1.8 and hopefully someone will be able to investigate it then.

/cc @junyer FYI

Answer 11 · 2016-05-23T15:30:08.000Z

Oddly enough, the | seems to be causing the problems here. If you remove it, POSIX mode matches just as quickly as Perl mode.

Answer 12 · 2016-05-23T16:41:35.000Z

... and if you place the | at the end instead, Perl mode matches just as slowly as POSIX mode. :)

Answer 13 · 2016-05-23T16:42:59.000Z

Per @junyer, the problem here is the use of FindAll. The linear in m and linear in n (=m) applies to a single match attempt. In this case the match attempt scans the entire string in order to decide that there is no match except the empty string. And then FindAll steps forward one byte and tries again. So you have m^2 for each match attempt, as promised, and then m match attempts as well. Mystery solved. We may still be able to do a little better, but that particular cubic (really quadratic) is unavoidable.

Answer 14 · 2016-05-29T15:29:34.000Z

Per @junyer, the problem here is the use of FindAll. The linear in m and linear in n (=m) applies to a single match attempt. In this case the match attempt scans the entire string in order to decide that there is no match except the empty string. And then FindAll steps forward one byte and tries again.

Just based on your description, if there is no match in the whole string, how can there be a match in a substring?

Answer 15 · 2016-05-29T15:57:27.000Z

Empty string is a valid match, and it is present many times in the result slice :
https://play.golang.org/p/mJN6sJWJ68

Answer 16 · 2016-05-29T16:08:01.000Z

@Deleplace I see, thanks.

Answer 17 · 2016-10-18T15:04:06.000Z

We know why this happens and it seems unavoidable.