Knagis/CommonMark.NET

How to implement Markdown subset similar to StackOverflow's mini-Markdown

Opened this issue · 2 comments

What is the easiest way to implement something similar to StackOverflow comments section? They refer to it as 'mini-Markdown': only italic, bold and code is allowed. So a white list of Markdown tags. Everything else, including HTML and other MD tags should be displayed as is (passed through or HtmlEncoded) in order to avoid XSS and to match specific business requirements.

Basically I need to let my users mark some of their text as bold or italic. I would also also allow paragraphs and lists. But everything else (quotations, links, images, headings, any HTML) should be preserved and displayed AS IS (HTML encoded because it will be rendered within bigger HTML page). Essentially I'm inventing my own super strict and limited subset of Markdown, lets call it MarkdownSlim. I want to implement it with CommonMark.NET because I may need to extend it easily in future (allow more MD tags).

I can not simply pass input through CommonMarkConverter.Convert because it may find and convert MD tags that I don't support into HTML. So they will be displayed differently from how they were entered.

Would this be a right approach? I tried it but it will require more debugging and learning since it does not seem to be preserving all input.

if (block.Tag == BlockTag.List || block.Tag == _OTHER_TAGS_ALLOWED_BY_MARKDOWNSLIM ) {

    base.WriteBlock(block, isOpening, isClosing, out ignoreChildNodes);

} else {

    ignoreChildNodes = false;
    if (block.StringContent!= null) {
        this.Write(AntiXss.HtmlEncode(block.StringContent.ToString()));
    }
}

protected override void WriteInline(
    Inline inline, 
    bool isOpening, 
    bool isClosing, 
    out bool ignoreChildNodes) {

    if (inline.Tag == InlineTag.Emphasis 
              || inline.Tag == _OTHER_TAGS_ALLOWED_BY_MARKDOWNSLIM_ ) {

        base.WriteInline(inline, isOpening, isClosing, out ignoreChildNodes);

    } else {

        ignoreChildNodes = false;
        this.Write(AntiXss.HtmlEncode(inline.LiteralContent));      
    }
}

I feel like this is a very common use case and I could not find a good example and I'm not sure I'm even on the right track. There seem to be a LOT OF INTEREST in implementing 'safe markdown' and I think it should boil down to be able to easily implement subsets of Markdown like the one I've described. Maybe a good example on a wiki?

Yes, I would probably to something like this - create a custom renderer that only renders the markup that you allow. This would render some unsupported things like lists or headings as plain text while removing the markdown specifics. You can also run the parser on old markdown inputs once you extend the list of supported things.

As for creating "safe output" - I think it should be possible to just run the whole generated HTML output through XSS encoder so that it encodes everything. The problem is finding a good library for that. Some time ago there wasn't anything in .NET world (only one older library from Microsoft that no longer supported the sanitization option), now a quick search showed this one that seems promising: https://github.com/mganss/HtmlSanitizer/

I use HtmlSanitizer for cleaning inputs where I have to allow HTML, and I've never really had problems with it. It has some nice extensibility points that allow you customize how you want to handle unwanted/disallowed inputs.

One caveat is that HtmlSanitizer recently switched to AngleSharp (from CsQuery) for parsing the entered markup, and AngleSharp seems to like introducing random breaking changes into their API (just something to be aware of). HtmlSanitizer handles that problem at this point by fixing their dependency on a specific version of AngleSharp as they can test it.

Just FYI.