Knagis/CommonMark.NET

Non-ASCII characters (?) are not preserved by `cmark.exe`

Closed this issue · 4 comments

I ran cmark Test.md --out Test.html where Test.md contains the following:

# Header – Something Else

Was the en dash character preserved or was it replaced with the Unicode replacement character?

In Test.html, the en dash character (in the header) is replaced with the Unicode replacement character (EF BF BD as hex).

Does your file has UTF-8 preamble (byte order marks)? Without them I guess the encoding is detected as ascii and thus the dash char is not recognized.

@Knagis I am guessing it does not. Notepad++ seems to think it's encoded in "ANSI", tho it displays the en dash character fine.

I was going to ask "Is there is any reason not to just read everything as UTF-8 given that it's backwards compatible with ASCII/ANSI?" but I tested the relevant CommonMark.NET code in LINQPad and it already is (defaulting to) using UTF-8.

I figured out that this is failing because the file is ISO-8859-1 encoded, the Encoding.Default encoding on my computer.

I changed the encoding of my file to UTF-8 but in the HTML file created by cmark, the en dash is output as:

–

This seems to be because cmark doesn't output a full valid HTML document. After I added a character encoding declaration, it rendered fine.

Would you be interested in a PR to extend the console app to support outputting full valid HTML documents?

Would you be interested in a PR to extend the console app to support outputting full valid HTML documents?

I think that it should already support this if you create two small .html files - one for the start of the file and one for the end and then run

cmark.exe header.html input.md footer.html --out result.html

@Knagis Ahhhhh – now it makes sense why it combines all of the output into a single file! HTML is interpreted 'raw' (unless some text was not inside an element) and so essentially just passed-thru. Thanks!