Exponential runtime for [link](url)
xrat opened this issue ยท 8 comments
(Apparently, I cannot add labels. Would have added: reader:markdown + performance.)
I ran into a case of exponential runtime for the Markdown reader that baffles me. Sadly, I was still not able to reproduce the bug with generic data. I cannot make my pathological file publicly available but I can share it privately upon request. The type of input is
[dir1/foo_bar_1100.md](https://www.example.org/dir1/foo/bar1)
[dir2/foo_bar_2100.md](https://www.example.org/dir2/foo/bar2)
...
The exponential runtime is obvious:
$ N=0; while ((N++<10)); do n=$((10*N)); head -n $n pathological.md > tmp.md;
echo -n "$n lines: "; start=${EPOCHREALTIME//[.,]};
pandoc --to native tmp.md >/dev/null;
echo $(((${EPOCHREALTIME//[.,]}-start)/1000)) ms ; done
10 lines: 24 ms
20 lines: 41 ms
30 lines: 45 ms
40 lines: 34 ms
50 lines: 34 ms
60 lines: 50 ms
70 lines: 286 ms
80 lines: 1838 ms
90 lines: 3549 ms
100 lines: 5158 ms
At this stage (following above code) tmp.md
has 100 lines of pathological.md
. I tried to run --trace
and I notice that the output seems broken:
$ time pandoc --trace --to native tmp.md | wc -l
[trace] Parsed [Para [Link ("",[],[]) [Str "2factor_authentication.md"] ("h at line 103
703
real 0m5.182s
user 0m5.122s
sys 0m0.061s
The Commonmark reader does not have this problem:
$ time pandoc --from commonmark_x --to native tmp.md | wc -l
703
real 0m0.088s
user 0m0.082s
sys 0m0.009s
Baffling, too, is the effect of s/^/* /
:
$ sed -i 's/^/* /' tmp.md; wc -l tmp.md
100 tmp.md
$ time pandoc --to native tmp.md | wc -l
1005
real 0m0.088s
user 0m0.063s
sys 0m0.024s
Pandoc 3.1.13 on amd64 Debian GNU/Linux 11.9
I can't do much without the file. You can email it to me; my email address is in pandoc.cabal in the top level of this repository.
Thanks, sent you the file.
In the meantime, I got a little closer to what might be the trigger, but I still fail to reproduce it with generic data.
- Wrapping the link text in `` also makes Pandoc parse the data just fine.
pandoc --from markdown-citations
also runs as fast as usually.- All other extensions make no difference, except
intraword_underscores
which seemingly makes Pandoc run forever. I ^C'ed it after 15s. But that might be due to another issue.
I think the problem is the URLs with double __
in them. If you take those out, it goes quite quickly. Now that I've seen that I should be able to narrow in on a fix.
Note: adding <..>
around your URLs may also work.
Here's a minimal case that takes 9s to parse on my machine:
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)
[x_x_x__x_x_x.x3x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x3x.x)
[x_x_x_x__x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x_x__x_x.x)
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)
[x_x_x__x_x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x)
[x_x_x__x_x_x.x3x.x](x://x.x.x.x/~x/x/x/x/x_x_x__x_x_x.x3x.x)
[x_x_x_x__x_x.x](x://x.x.x.x/~x/x/x/x/x_x_x_x__x_x.x)
For reference, your minimal case takes 12s on my machine.
Thanks to your analysis I now found a minimal case that is way simpler: [foo__bar](url)
$ N=0; while ((N++<9)); do n=$((2*N)); for ((i=1;i<=n;i++)); do echo "[foo__bar](url)" ; done > tmp.md; start=${EPOCHREALTIME//[.,]}; echo -n "$n lines: "; pandoc tmp.md >/dev/null; echo $(((${EPOCHREALTIME//[.,]}-start)/1000)) ms ; done
2 lines: 24 ms
4 lines: 11 ms
6 lines: 34 ms
8 lines: 64 ms
10 lines: 154 ms
12 lines: 484 ms
14 lines: 1868 ms
16 lines: 7114 ms
18 lines: 28634 ms
So, 18 lines of [foo__bar](url)
take ~28s on my machine.
Substituting *
for _
gives the same result. This must have something to do with emphasis parsing, but I'm still puzzled.
More data points in the hope they are of use for the debugging:
[foo_bar_foobar](url)
: linear time[foo__bar__foobar](url)
: exponential time[foo___bar___foobar](url)
: linear time* [foo__bar__foobar](url)
: linear time[__foo__](url)
: linear time[foo__bar__](url)
: exponential time[__foo__bar](url)
: exponential time
Commenting out cite
in
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Markdown.hs#L1527
fixes the issue. That's where the problem lies.
The link parser grabs a string between [..]
and parses it, but the cite parser just ploughs ahead parsing inlines after [
, not stopping when it sees ]
but allowing that as a constituent of an emphasized inline. Culprit is the parser normalCite
.
Not sure about the exact fix yet but at least I now see the problem.