TOC generation clobbers int'l characters in output anchors
davidfstr opened this issue · 9 comments
Repro by running the following commands at a shell:
$ ./markdown -V
markdown: discount 2.1.8 DL=BOTH GITHUB-TAGS FENCED-CODE
$ ./markdown -f toc
# 中
## 学
####私
^D
<p>#</p>
<a name="L..."></a>
<h2>学</h2>
<a name="L..."></a>
<h4>私</h4>
Notice that the output anchor tags have names where the original international characters are clobbered. In this case it causes all of the anchors to have the same name, since international characters make up 100% of the header names.
This issue originates downstream from davidfstr/rdiscount#129 .
Ugh. The whole toc code is a bit of a dog's breakfast; It's going to need to be substantially rewritten to deal with this feature
Isn't it great to have actual customers finding bugs. :-)
Heh.
FYI: Issue is still present in Discount 2.2.0. (I thought I'd check to see whether it was fixed since 2.1.8.)
I keep looking in at the TOC code, shuddering with horror, and looking away. Sorry :-(
For scary code areas like that, I typically write a bunch of test cases to pin down the current behavior. Once I have enough of those, I don't feel as concerned about making changes.
Test cases won't really help, since the current output is basically broken. The code I was given generates the id= fields on the fly and doesn't store them, so any collisions from collapsing utf-8 into a-zA-Z0-9{+punctation} aren't noticed. The paragraph tree needs to be modified to contain an id field that can be populated with collapsed headers, and then when I generate a new header I need to walk the tree to ensure that I'm not colliding. (Or I need to just discard any attempt to do meaningful id='s from headers and just do numeric labels with a prefix. Given that xhtml is utf-8 hostile (16 million glyphs out there and we can use 70 of 'em?) attempting meaningful labels is probably a horrible horrible dead end, but it bothers me to go from id="something.awful" to id="toc.42".)
And this is where I start shuddering with horror.
A suggested strategy:
- With headers that are mostly ASCII, it seems that you could get reasonable output with an algorithm that matches or approximates the current one.
- And for headers that are mostly non-ASCII Unicode, you could extend the algorithm to add a counting suffix when a collision is detected.
This strategy preserves current behavior for ASCII headers, reasonable behavior for mostly-ASCII headers, and something workable for all other headers.
Also worth considering: It would be undesirable to change the behavior of ASCII-only headers, since it is likely that folks are relying on older URLs which are generated with the original algorithm.
FINALLY fixed by tweaking how I escape out-of-namespace characters. (normal behavior changes out of namespace characters to -XX- hex, changes space characters to -, and hexifies -; urlencodedanchor (in the throes of being renamed to html5anchor) changes space to -, hexifies out of namespace characters + '%' to %XX hex sequences.)
It slightly breaks backwards compatability (because of mapping space to - instead of .) but it should make non-ascii anchors less horribly collidy.