golang/go

regexp: POSIX regexp takes 4 seconds to execute

dvyukov opened this issue · 17 comments

The following FindAllString takes 4 seconds to execute. Not sure whether there is a performance problem or not, but the package claims to have linear execution time, so that's 5.5 ms per character which looks like an issue.

package main

import "regexp"

func main() {
    re := regexp.MustCompilePOSIX("|{0x6B27143FFb2C6.0xAfbb212adA07dC16E19-442426454885445967343337981592e-0xFDbfC4aD3F17CFC4Bd80x0C9aedB8cFCAb57bBD9AcDdfE10858856}^{-0363642521013670e-795506009086632}b|^050226227030136e1509816632298701486455993285303619190b|([^!]+{953})|^J4__P17Y#^^{0035.-0610260367-128}^f__R_f6_s_v6_8XLItr7_1_B__gxdp6ko81_tXI_4m94u9_q__78PA7_1___8tct__kY8_94|b5165Jp2_3N_xg3_|$\x02Q9m5C023w_Ej0_guW2%8_90a1\x0fAzPq4d_J__Kre4nkw3G\ftwentyeighttwentynineirtyone|99hSe9M7q1M_jbxbwJY4YFxDZ___Q_t2v___w_NeXertyeight|thirtynine|[AZ-]tyfive|D_||ortyeight|_ki_xbuJvO__102___2_KSS8Rkp2C__|fy||6G5a_15Zgx58__56E0__9bBfkKB_37_b9N__IFZ_n3EMk2__p|T__OFK__O2E3>|ht|$|||isie||efoursevsevnt||o|_6U|eix|_4aN964__A34lcQq|eightyeightthirtyeight|thirtynine_L")
    re.FindAllString("S__c2_Oe_5TK2cNDf2aC_F4y_Vwi____ggH_jR_95b_R8vFs8NEBs_5Ikmf5S8Laexw40C[five|X2R8r5__A9_u__c__O__150G_az_MbK__04XGMmF_R_h_FwYp_g5_2___bic|seven|PS1|nine|71U|b1_U0a_t_4VKEqIRq7_2_g_giC1Hj_2__q9__La__7vmz_GnPT_Y5_iw1Wu__k33V_WW_gi__6__D_Vw_2qb_32|__80Tj6_e9F_PL__c1_03_xm2TK_1820bcWKL4B_8_T6|fourteen|fifteen|[^]\x1bseventeen\x13<nineteen|\t|\x05|twentytwo_|\x12twentyfive|\x16\x00]twentyeight|twentynine|FVSv_pp3cE|thirtyone|\x03^|_|thirtyfive|thirtysix|:^xdigitthree|Wc___7_81DL__2JZ4S1u_lY6U9sN_1tw8bI6__k__f1Tn_gsN_EeQ4H2KR_9r_V_N5_t4XuIpdzt_WD3Uvq_X9_rWw_A__X|five|six\x0f83|G_h|nine|ten|eleventwelve|thirteen|-0xaaaaDe20daBF3baB|_miAsixteen|seventeen|eighteen|nineteenZa:3\"}('(|?b\x00\x00ascii[AZ-]D)e_4bVe___53_08__QX[:\x1b__fXM_C0_bU:]|^{32}\x16'{$}", -1)
}

go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64

mattn commented

interesting. playground seems to give up.

http://play.golang.org/p/YIrDvZx0Fk

I put a simplified test case at https://play.golang.org/p/Pkm1dxj2Pq

The problem is the "[^!]+{999}" You can make the regular expression take an arbitrarily long time by just tacking more copies of that string onto the end of the regexp in the simplified example I gave.

The nested repetition currently only causes an regexp parse error when using the PerlX syntax flag.

https://go-review.googlesource.com/11080/

Thanks, @tolchz

Here is arguably even simpler version, runs for 8 seconds

package main

import (
    "regexp"
    "strings"
)

func main() {
    re := regexp.MustCompilePOSIX("|A+{1000}")
    re.FindAllString(strings.Repeat("A", 999), -1)
}

CL https://golang.org/cl/11080 mentions this issue.

More fun can be had too:

package main
import (
    "regexp"
    "strings"
)

func main() {
    re := regexp.MustCompilePOSIX("|A+{1000}|[^B]+{1000}|[^C]+{1000}|[^D]+{1000}")
    re.FindAllString(strings.Repeat("A", 999), -1)
}

real 0m32.585s
user 0m32.637s
sys 0m0.008s

google/re2@7444e38 is probably RTYI. I could work on porting this to Go if you like. :)

(Ping, @kcc and @rsc.)

rsc commented

This is working as intended. The repetition is just that, a repetition. If you make that number bigger, the time takes longer. It's linear in the size of the regexp where the size of a{1000} is 1000. It's also linear in the size of the input: O(nm) where n is text size and m is regexp size. This is more or less fundamental.

It is not linear.
"|A+{500}" takes 0.2s
"|A+{1000}" takes 8s

It makes sense that runtime is linear in n, but only for a fixed regexp size m.

We must also interpret it as "worst case upper bound is linear", and this isn't a promise that average case will grow linearly.

I ran a simulation with pattern "|A+{m}", with various values of n from 1 to 1000, and m from 1 to 1000, here the result (high durations in red) :

regexp_not_linear_plot

This shows that the worst case is obtained with n = m - 1

We can't write pattern "|A+{m}" with m > 1000, however it must be equivalent (iiuc) to expanded pattern "|(A+A+A+ ... A+)" with m repetitions of A+ .

So I ran a second simulation with n = m - 1 , with various values of m from 1 to 2000 :

re = regexp.MustCompilePOSIX("|(" + strings.Repeat("A+", m) + ")") re.FindAllString(strings.Repeat("A", m-1), -1)

and plotted the runtime on Y axis :

regex_time_curves

According to the "linear in m and linear in n" promise, we should have worst time proportional to m^2 .

Unfortunatly, the results (for this particular pattern and input string) clearly indicate that worst time is proportional to m^3 .

rsc commented

@Deleplace very interesting, thanks. I read a paper recently (can't share it yet sorry) that might help us implement counted repetition more efficienctly, but even without that I am a bit surprised by the m^3. I'd like to understand that better, but I'm about to go on leave for the summer. I'll flag this for Go 1.8 and hopefully someone will be able to investigate it then.

/cc @junyer FYI

Oddly enough, the | seems to be causing the problems here. If you remove it, POSIX mode matches just as quickly as Perl mode.

... and if you place the | at the end instead, Perl mode matches just as slowly as POSIX mode. :)

rsc commented

Per @junyer, the problem here is the use of FindAll. The linear in m and linear in n (=m) applies to a single match attempt. In this case the match attempt scans the entire string in order to decide that there is no match except the empty string. And then FindAll steps forward one byte and tries again. So you have m^2 for each match attempt, as promised, and then m match attempts as well. Mystery solved. We may still be able to do a little better, but that particular cubic (really quadratic) is unavoidable.

Per @junyer, the problem here is the use of FindAll. The linear in m and linear in n (=m) applies to a single match attempt. In this case the match attempt scans the entire string in order to decide that there is no match except the empty string. And then FindAll steps forward one byte and tries again.

Just based on your description, if there is no match in the whole string, how can there be a match in a substring?

Empty string is a valid match, and it is present many times in the result slice :
https://play.golang.org/p/mJN6sJWJ68

@Deleplace I see, thanks.

rsc commented

We know why this happens and it seems unavoidable.