davidfstr/rdiscount

Unicode headers produce invalid anchors

mitchelltd opened this issue · 1 comments

When rdiscount processes headers to produce anchors (for use in TOC generation) it transforms UTF-8 into ASCII-8BIT. In the process, it turns non-ASCII characters into question marks. But question marks are reserved characters in URLs.

Example :

irb(main):001:0> require 'rdiscount'
=> true
irb(main):002:0> test = "# Précis"
=> "# Précis"
irb(main):003:0> rd = RDiscount.new(test, :generate_toc)
=> #<RDiscount:0x007f92d2026630 @text="# Précis", @generate_toc=true>
irb(main):004:0> puts rd.toc_content
<ul>
<li><a href="#Pr?.cis">Précis</a></li>
</ul>

=> nil
irb(main):005:0> test.encoding
=> #<Encoding:UTF-8>
irb(main):006:0> (rd.toc_content).encoding
=> #<Encoding:ASCII-8BIT>

It is worth comparing this outcome with that of GitLab flavoured markdown, which preserves unicode characters in link IDs.

Known issue. Tracking in #129