commonmark/cmark

Quadratic behavior when parsing emphasis

nwellnhof opened this issue · 0 comments

Parsing a *t sequence followed by a _t*_ sequence exhibits quadratic behavior:

$ for n in 5000 10000 20000; do python3 -c "print('*t '*$n+'_t*_ '*$n)" |time -f "$n elems: %e secs" build/src/cmark >/dev/null; done
5000 elems: 0.19 secs
10000 elems: 0.88 secs
20000 elems: 5.45 secs

Sample pattern

*t *t *t *t _t*_ _t*_ _t*_ _t*_

Analysis

  • After matching a * closer, the preceding _ opener is removed.
  • After parsing a _ closer, no opener is found. The corresponding openers_bottom entry is set to preceding * opener.
  • After matching the next * closer, the corresponding opener is removed. This is the entry in openers_bottom.
  • The entry in openers_bottom becomes a dangling pointer.
  • Subsequent searches for an opener for _ closers will test against the dangling pointer (which is undefined behavior).
  • Since the opener was removed, there will never be a match.
  • The openers_bottom optimization stops working, resulting in quadratic behavior with abusive input patterns.

Possible solutions

  • Adjust openers_bottom when removing delimiters (somewhat expensive).
  • Number the delimiters and store the delimiter position instead of a pointer in openers_bottom.