jgm/pandoc

Exponential runtime for [link](url)

xrat opened this issue ยท 8 comments

xrat commented

(Apparently, I cannot add labels. Would have added: reader:markdown + performance.)

I ran into a case of exponential runtime for the Markdown reader that baffles me. Sadly, I was still not able to reproduce the bug with generic data. I cannot make my pathological file publicly available but I can share it privately upon request. The type of input is

[dir1/foo_bar_1100.md](https://www.example.org/dir1/foo/bar1)
[dir2/foo_bar_2100.md](https://www.example.org/dir2/foo/bar2)
...

The exponential runtime is obvious:

$ N=0; while ((N++<10)); do n=$((10*N)); head -n $n pathological.md > tmp.md;
echo -n "$n lines: "; start=${EPOCHREALTIME//[.,]};
pandoc --to native tmp.md >/dev/null;
echo $(((${EPOCHREALTIME//[.,]}-start)/1000)) ms ; done
10 lines: 24 ms
20 lines: 41 ms
30 lines: 45 ms
40 lines: 34 ms
50 lines: 34 ms
60 lines: 50 ms
70 lines: 286 ms
80 lines: 1838 ms
90 lines: 3549 ms
100 lines: 5158 ms

At this stage (following above code) tmp.md has 100 lines of pathological.md. I tried to run --trace and I notice that the output seems broken:

$ time pandoc --trace --to native tmp.md | wc -l
[trace] Parsed [Para [Link ("",[],[]) [Str "2factor_authentication.md"] ("h at line 103
703

real    0m5.182s
user    0m5.122s
sys     0m0.061s

The Commonmark reader does not have this problem:

$ time pandoc --from commonmark_x --to native tmp.md | wc -l
703

real    0m0.088s
user    0m0.082s
sys     0m0.009s

Baffling, too, is the effect of s/^/* /:

$ sed -i 's/^/* /' tmp.md; wc -l tmp.md
100 tmp.md
$ time pandoc --to native tmp.md | wc -l
1005

real    0m0.088s
user    0m0.063s
sys     0m0.024s

Pandoc 3.1.13 on amd64 Debian GNU/Linux 11.9

jgm commented

I can't do much without the file. You can email it to me; my email address is in pandoc.cabal in the top level of this repository.

xrat commented

Thanks, sent you the file.

In the meantime, I got a little closer to what might be the trigger, but I still fail to reproduce it with generic data.

  • Wrapping the link text in `` also makes Pandoc parse the data just fine.
  • pandoc --from markdown-citations also runs as fast as usually.
  • All other extensions make no difference, except intraword_underscores which seemingly makes Pandoc run forever. I ^C'ed it after 15s. But that might be due to another issue.
jgm commented

I think the problem is the URLs with double __ in them. If you take those out, it goes quite quickly. Now that I've seen that I should be able to narrow in on a fix.

Note: adding <..> around your URLs may also work.

jgm commented

Here's a minimal case that takes 9s to parse on my machine:

[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x3x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x3x.x)  
[x_x_x_x__x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x_x__x_x.x)  
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)  
[x_x_x__x_x_x.x3x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x3x.x)  
[x_x_x_x__x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x_x__x_x.x)  
xrat commented

For reference, your minimal case takes 12s on my machine.

Thanks to your analysis I now found a minimal case that is way simpler: [foo__bar](url)

$ N=0; while ((N++<9)); do n=$((2*N)); for ((i=1;i<=n;i++)); do echo "[foo__bar](url)" ; done > tmp.md; start=${EPOCHREALTIME//[.,]}; echo -n "$n lines: "; pandoc tmp.md >/dev/null; echo $(((${EPOCHREALTIME//[.,]}-start)/1000)) ms ; done
2 lines: 24 ms
4 lines: 11 ms
6 lines: 34 ms
8 lines: 64 ms
10 lines: 154 ms
12 lines: 484 ms
14 lines: 1868 ms
16 lines: 7114 ms
18 lines: 28634 ms

So, 18 lines of [foo__bar](url) take ~28s on my machine.

jgm commented

Substituting * for _ gives the same result. This must have something to do with emphasis parsing, but I'm still puzzled.

xrat commented

More data points in the hope they are of use for the debugging:

  • [foo_bar_foobar](url): linear time
  • [foo__bar__foobar](url): exponential time
  • [foo___bar___foobar](url): linear time
  • * [foo__bar__foobar](url): linear time
  • [__foo__](url): linear time
  • [foo__bar__](url): exponential time
  • [__foo__bar](url): exponential time
jgm commented

Commenting out cite in
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Markdown.hs#L1527
fixes the issue. That's where the problem lies.

The link parser grabs a string between [..] and parses it, but the cite parser just ploughs ahead parsing inlines after [, not stopping when it sees ] but allowing that as a constituent of an emphasized inline. Culprit is the parser normalCite.

Not sure about the exact fix yet but at least I now see the problem.