Knagis/CommonMark.NET

PlainText and Rtf converter

Closed this issue · 13 comments

I'm looking since days for a library that can handle html, markdown and rtf.
It would be really cool if you would implement at least a markdown to plaintext converter.
Maybe it's not so hard to make it myself, but I think it's a point of a library, to be functionally and reusable.

I've made methods to parse markdown into html before. They involve using commonmark to get the syntax tree, then looping through the tree taking the literal content. You still have to write a method to strip the html where encountered as well though. See example:

var document = CommonMarkConverter.Parse(markdown));

var sb = new StringBuilder();
foreach (var node in document.AsEnumerable().Where(node => node.IsOpening))
{
    if (node.Inline != null)
    {
        if (node.Inline.Tag == InlineTag.String)
        {
            sb.AppendFormat("{0} ", node.Inline.LiteralContent);
        }
        else if (node.Inline.Tag == InlineTag.RawHtml)
        {
            sb.AppendFormat("{0} ", ParseHtml(node.Inline.LiteralContent));
        }
    }
    else if (node.Block != null && node.Block.Tag == BlockTag.HtmlBlock)
    {
        sb.AppendFormat("{0} ", ParseHtml(node.Block.StringContent.ToString()));
    }
}

// no attention is made above to preserve correct whitespace, clean up excessive spacing
return Regex.Replace(sb.ToString(), "\\s+", " ").Trim().Replace(" ,", ",").Replace(" .", ".");

This is sufficient for my purposes as I only need to use a snippet of the original content so i don't care about things like whitespace. IIt should give you somewhere to start from.

Thank you, but I don't want to implement it myself. A library should just do it for me. Because few months later, someone else will come with the same problem and will also make it himself. I prefer to avoid redundancy. In my opinion is CommonMark.NET the right place for such a function.

How do you want to use the plain text output?

We got an own intern software that works with Rtf or PlainText. I've to make a sync between JIRA and our software. JIRA use markdown, so I have to convert markdown to Rtf or PlainText.
Currently I convert markdown to html with CommonMark.NET and convert html to Plaintext with regex. I'm replacing <p>, <div>, </br>, etc with a return line and removing anything else.

Why are you not just using markdown as the plain text? It is the main purpose of markdown - to be easily readable as plain text...

Because in some places it looks weird?
I think the better question is, why I'm not just stripping markdown to plaintext.

To my knowledge, Jira doesn't actually use markdown, it uses wiki markup. The only Atlassian product that uses markdown is bitbucket server..

Are you using the Jira plugin here? If so, you should know that it is based on stack-overflow flavored markdown which means it will not be fully supported by the CommonMark standard and this library.

I know, but wiki markup from JIRA is very similar.
But whatever, my current problem have nothing to do with the idea itself.

Because in some places it looks weird?

Can you give an example and what you would like in your output instead?

Something like this.
Markdown Code:

### Header
**Some bold text**  
A [link](https://github.com/Knagis/CommonMark.NET/issues/86#issuecomment-230293997).  
* Item 1
* Item 2

> Welcome to
> blockqoute

First Header | Second Header
------------ | -------------
cell 1 |  cell 2

PlainText:

Header
Some bold text
A link(https://github.com/Knagis/CommonMark.NET/issues/86#issuecomment-230293997).  
- Item 1
- Item 2

"Welcome to blockqoute"

First Header | Second Header
------------ | -------------
cell 1       | cell 2

What you described here is partially markdown formatter (pretty-print the tables) and partially removing markdown symbols (headings, half of the links). These are requirements that are rather specific to your scenario.

This is not something that will be supported by CommonMark.NET out of the box. You can achieve that by implementing a custom formatter.

Have you looked at pandoc? It supports RTF and it might be possible to configure it to match your expectations.

pandoc is an external program and it works with files. I've strings and hoped that I could go without extremal executions.

Closing this as I mentioned the requirement is too specific to your use case plus it wouldn't solve it completely anyway as CommonMark does not support tables.