trentm/python-markdown2

Converting non-english anchor tags leads to "-x" values (or umlauts are replaced)

HardMax71 opened this issue · 2 comments

Describe the bug
When converting non-english text with anchor tags in UTF-8 to html, output tags are "-1", "-2", .. instead of error thrown / tag converted in Russian. Also in case of German (not tested on other languages with umlauts), umlauts (= ä, ö, ü, ..) are changed to their default versions (a, o, u, ..) in id's.

To Reproduce
With english text:

# main.py
import markdown2
help_text = '''
# Header

## Table of Contents
1. [Getting Started](#getting-started)

### Getting Started {#}
To begin using the application, launch `main.py`.
'''

help_text_html = markdown2.markdown(help_text, extras=['header-ids'])
print(help_text_html)

Result (all ok):

<h1 id="header">Header</h1>

<h2 id="table-of-contents">Table of Contents</h2>

<ol>
<li><a href="#getting-started">Getting Started</a></li>
</ol>

<h3 id="getting-started">Getting Started {#}</h3>

<p>To begin using the application, launch <code>main.py</code>.</p>

With Russian text, encoding - UTF-8:

import markdown2
help_text = '''
# Руководство 

## Содержание
1. [Начало работы](#начало-работы)

### Начало работы {#}
Для начала работы запустите `main.py`.
'''

help_text_html = markdown2.markdown(help_text, extras=['header-ids'])
print(help_text_html)

Output (id's are somehow "-x"..):

<h1 id="-1">Руководство</h1>

<h2 id="-2">Содержание</h2>

<ol>
<li><a href="#начало-работы">Начало работы</a></li>
</ol>

<h3 id="-3">Начало работы {#}</h3>

<p>Для начала работы запустите <code>main.py</code>.</p>

With German text, encoding - UTF-8 (Umlauts replaced in id's):
<had to change text a bit cause translation for text above doesn't contain any umlauts by default>

import markdown2
help_text = '''
## Handbuch 

## Inhalt
1. [ü-umlaut-test-encoding](#ü-umlaut-test-encoding)

### ü-umlaut-test-encoding {#}
Führen Sie `main.py` aus, um loszulegen.
'''

help_text_html = markdown2.markdown(help_text, extras=['header-ids'])
print(help_text_html)

Output:

<h2 id="handbuch">Handbuch</h2>

<h2 id="inhalt">Inhalt</h2>

<ol>
<li><a href="#ü-umlaut-test-encoding">ü-umlaut-test-encoding</a></li>
</ol>

<h3 id="u-umlaut-test-encoding">ü-umlaut-test-encoding {#}</h3>

<p>Führen Sie <code>main.py</code> aus, um loszulegen.</p>

Expected behavior
In case if only ASCII is supported, it would like to see an error thrown with sort of "Unsupported character at position XYZ" description. Also i would expect warning and/or error in case of German where "ü" would be preserved in text, in link (#ü-umlaut-test-encoding), BUT not in id: <h3 id="**u**-umlaut-test-encoding">

Debug info
markdown2 version = 2.4.10

Any extras being used: 'header-ids'

Seems to be a problem in the _slugify function where we encode all chars as ascii and ignore all errors.

value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode()

Git blame shows this was last touched April 2012, so I guess this was a compatibility limitation at the time? The wiki page also explicity says header IDs are ASCII.
@nicholasserra can you see any issues with bumping this up to utf-8?

I have no idea what the effects might be. If we switch and do some proper testing I don't see any reason not to.