Converting non-english anchor tags leads to "-x" values (or umlauts are replaced)
HardMax71 opened this issue · 2 comments
Describe the bug
When converting non-english text with anchor tags in UTF-8 to html, output tags are "-1", "-2", .. instead of error thrown / tag converted in Russian. Also in case of German (not tested on other languages with umlauts), umlauts (= ä, ö, ü, ..) are changed to their default versions (a, o, u, ..) in id's.
To Reproduce
With english text:
# main.py
import markdown2
help_text = '''
# Header
## Table of Contents
1. [Getting Started](#getting-started)
### Getting Started {#}
To begin using the application, launch `main.py`.
'''
help_text_html = markdown2.markdown(help_text, extras=['header-ids'])
print(help_text_html)
Result (all ok):
<h1 id="header">Header</h1>
<h2 id="table-of-contents">Table of Contents</h2>
<ol>
<li><a href="#getting-started">Getting Started</a></li>
</ol>
<h3 id="getting-started">Getting Started {#}</h3>
<p>To begin using the application, launch <code>main.py</code>.</p>
With Russian text, encoding - UTF-8:
import markdown2
help_text = '''
# Руководство
## Содержание
1. [Начало работы](#начало-работы)
### Начало работы {#}
Для начала работы запустите `main.py`.
'''
help_text_html = markdown2.markdown(help_text, extras=['header-ids'])
print(help_text_html)
Output (id's are somehow "-x"..):
<h1 id="-1">Руководство</h1>
<h2 id="-2">Содержание</h2>
<ol>
<li><a href="#начало-работы">Начало работы</a></li>
</ol>
<h3 id="-3">Начало работы {#}</h3>
<p>Для начала работы запустите <code>main.py</code>.</p>
With German text, encoding - UTF-8 (Umlauts replaced in id's):
<had to change text a bit cause translation for text above doesn't contain any umlauts by default>
import markdown2
help_text = '''
## Handbuch
## Inhalt
1. [ü-umlaut-test-encoding](#ü-umlaut-test-encoding)
### ü-umlaut-test-encoding {#}
Führen Sie `main.py` aus, um loszulegen.
'''
help_text_html = markdown2.markdown(help_text, extras=['header-ids'])
print(help_text_html)
Output:
<h2 id="handbuch">Handbuch</h2>
<h2 id="inhalt">Inhalt</h2>
<ol>
<li><a href="#ü-umlaut-test-encoding">ü-umlaut-test-encoding</a></li>
</ol>
<h3 id="u-umlaut-test-encoding">ü-umlaut-test-encoding {#}</h3>
<p>Führen Sie <code>main.py</code> aus, um loszulegen.</p>
Expected behavior
In case if only ASCII is supported, it would like to see an error thrown with sort of "Unsupported character at position XYZ" description. Also i would expect warning and/or error in case of German where "ü" would be preserved in text, in link (#ü-umlaut-test-encoding), BUT not in id: <h3 id="**u**-umlaut-test-encoding">
Debug info
markdown2 version = 2.4.10
Any extras being used: 'header-ids'
Seems to be a problem in the _slugify
function where we encode all chars as ascii and ignore all errors.
python-markdown2/lib/markdown2.py
Line 2846 in 958eea4
Git blame shows this was last touched April 2012, so I guess this was a compatibility limitation at the time? The wiki page also explicity says header IDs are ASCII.
@nicholasserra can you see any issues with bumping this up to utf-8?
I have no idea what the effects might be. If we switch and do some proper testing I don't see any reason not to.