psf/black

Freeze/bad performance on escaped quotation in f-string

Opened this issue · 16 comments

Describe the bug
black hangs (or takes an extremely long time?) to process a file with a multiline f-string containing a bunch of escaped quotation marks.

To Reproduce
Run black against this file:

def testing():
    somevar = "some value"
    return f"""
        {somevar} the following line causes the issue:
         \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" 
    """

Will just hang without any output.

Expected behavior
Should run to completion within at most 1 second.

Environment

  • Black's version: 24.10.0 (was working with 24.4.2, probably introduced in 24.8.0)
  • OS and Python version: macOS/Python 3.13.0

Additional context
Extracting the quotes into a variable seems to circumvent the issue:

def testing():
    somevar = "some value"
    quotes = """\" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \" \""""
    return f"""
        {somevar} the following line causes the issue:
         {quotes} 
    """

The hang happens inside tokenization at this line endmatch = endprog.match(line)

At that point endprog is this regex ((?:\\N{|{{|\\"|"(?!"")|[^"{])*(?<!\\N){(?!{)|(?:\\.|"(?!"")|[^"\\])*""") which has catastrophic backtracking.

Why I think it catastrophically backtracks

Since the regex is pcre2 compatible, I can use regex101's debugger. It looks like the issue is since the first repeated group (?:\\N{|{{|\\"|"(?!"")|[^"{]) contains both "(?!"") and [^"{] as alternatives, it is guaranteed to match any sequence of characters as long as they don't contain a {. Thus this first repeat will always match up to the end of the string. After this, since the string doesn't contain {, the match for it will cause the entire regex to backtrack. This is already bad, with simple long strings taking hundreds of steps, but gets much worse when \"s are added. Every \ and " is a valid starting point for the first repeat, so each one makes the entire regex run again, with one less length, for (I think) O(nlog(n)) performance.

That regex got into endprog_stack at this line endprog_stack.append(endprog) from endprog = endprogs[token]

At that point token is f""", as expected.

endprogs[f"""] is generated at this line **{f'{prefix}"""': double3prog_plus_lbrace for prefix in _fstring_prefixes},

double3prog_plus_lbrace is generated by double3prog_plus_lbrace = re.compile(group(Double3Lbrace, Double3))

Double3Lbrace is the offending portion, being set here Double3Lbrace = r'(?:\\N{|{{|\\"|"(?!"")|[^"{])*(?<!\\N){(?!{)'

Notably this is commented with # beginning of a triple quoted f-string. must not end with {{ or \N{, so perhaps this is being misused to generate part of endprogs? Not sure of the original intent here.

It looks like single quoted f-strings don't add their same version of the regex to endprog_stack, so they don't hang. I'm not sure why there is that difference between single and triple quoted strings.

Thanks for the analysis @MeGaGiGaGon! Also cc @tusharsadhwani.

Thanks for the analysis! I'm guessing it is a backtracking bug as well. I'm hoping to reproduce the slowness in regex101, and look for a workaround that doesn't break anything.

Reproduced: https://regex101.com/r/tmjlX1/1

Adding any more \" in there causes regex101 to bail out. Every new \" multiplies the backtrack count by 2.

https://regex101.com/r/H8yqIO/1/debugger

And this debugger view shows what's going on: for every combination of number of \"'s present it tries to match the end sequence and fails, except at the very last one.

I don't know why regex101 is weird about it, but if you put Double3Lbrace in a non-capturing group and add an |a after, it will show you the amount of steps needed to fail +1. Works both in PCRE2 mode and python. https://regex101.com/r/Nt4Wuh/1

@JelleZijlstra while I'm debugging this, I do believe the tokenizer might be in need for a rewrite without using any regex, either in Python itself (if you think that won't be too slow), or in another langauge (which can also bring a nice perf. boost)

I have already written a tokenizer that is written in Zig and is much faster, passes all tests in black, and it can be turned into a Python package. Would you consider adopting it into black?

Or if not, would porting the exact same logic (it iterates over each character with almost no backtracking) into Python be too slow?

Thanks for looking into this! If it works well, I'd consider switching to your tokenizer, but dependencies that need compilation can make it harder for users to use Black. In the past, that was the motivation for us to switch from regex to re. I'm not familiar with Zig at all; I assume you'd provide wheels for common platforms, but for people not covered by those wheels, it might be even harder to use the package as they'd need a Zig compiler.

I'm also not convinced this issue is a reason to throw out the entire current tokenizer. It seems it should be possible to write a local fix for this issue.

It seems it should be possible to write a local fix for this issue.

Correct. I'm doing that today hopefully.

it might be even harder to use the package as they'd need a Zig compiler.

That's correct. Alternative would be to do a naïve Python port of the logic, making it much simpler, but without using any regex at all. Have you looked into that at all? Do you think it'll be a major performance hit?

Understood the backtracking problem. Compare these two:

https://regex101.com/r/SPaNxh/1

vs.

https://regex101.com/r/o24Ekt/1

First one is linear time, second one is exponential time for the number of \" and \N{ present. Second one just excludes matches that end in {{. I'll try to move the {{ exclusion logic to Python instead, should fix the bug.

Alternative would be to do a naïve Python port of the logic, making it much simpler, but without using any regex at all. Have you looked into that at all? Do you think it'll be a major performance hit?

I haven't tried that. I don't have a good intuition as to what the performance impact would be.

Alright. I'll do a simple port and benchmark it over the weekend.

The plan for the fix right now is:

  • Change the Regex to get rid of the { at the end:

     # beginning of a single quoted f-string. must not end with `{{` or `\N{`
    -SingleLbrace = r"(?:\\N{|{{|\\'|[^\n'{])*(?<!\\N)({)(?!{)"
    -DoubleLbrace = r'(?:\\N{|{{|\\"|[^\n"{])*(?<!\\N)({)(?!{)'
    +SingleLbrace = r"(?:\\N{|{{|\\'|[^\n'{])*(?<!\\N)"
    +DoubleLbrace = r'(?:\\N{|{{|\\"|[^\n"{])*(?<!\\N)'
     
     # beginning of a triple quoted f-string. must not end with `{{` or `\N{`
    -Single3Lbrace = r"(?:\\N{|{{|\\'|'(?!'')|[^'{])*(?<!\\N){(?!{)"
    -Double3Lbrace = r'(?:\\N{|{{|\\"|"(?!"")|[^"{])*(?<!\\N){(?!{)'
    +Single3Lbrace = r"(?:\\N{|{{|\\'|'(?!'')|[^'{])*(?<!\\N)"
    +Double3Lbrace = r'(?:\\N{|{{|\\"|"(?!"")|[^"{])*(?<!\\N)'
  • Add a check wherever we use the token to find an FSTRING_MIDDLE, if the token is followed by a {{, to save some state and try to find the next possible match just after the {{ instead.

Okay I tried attempting this and the states are too complicated and hard to work with right now.

In other news, my Python 3.12 tokenizer rewrite is currently passing >99.5% of black's test suite, so that's probably good?

I tried running my tokenizer on every single Python file in the dependency tree of one of my projects, and it accurately tokenized 98% of the files (3849 / 3902):

Screenshot 2024-12-19 at 2 42 29 AM

And this will probably be zero dependencies and only some 400 lines of pure Python code.

Screenshot 2024-12-20 at 1 49 47 AM

Tokenizer parity is at 100%.

I have a put up a rough version here:

https://github.com/tusharsadhwani/pytokens

This is a quick port from Zig to Python, and I may have made small errors. I'll check tomorrow and try to throw it into Black.