Knagis/CommonMark.NET

Syntax Highlighting for Editors

Closed this issue · 19 comments

I'm using CommonMark.NET for a Windows based Markdown editor. (https://github.com/mike-ward/Markdown-Edit). Currently, I'm using regex's for syntax highlighting source (not rendered output, but source), which has lots of limitations due to the block nature of the language.

So..., I was thinking, it might be very cool to generate text segments with metadata for use with source highlighting. This could be used to correctly highlight source markdown. Is this something you would be interested in collaborating on?

I would really like to have this feature. There are some complexities of implementing it:

  • Tab to space conversion - since the position has to represent the original data, the parser needs to keep track where were the tabs located.
  • Indented blocks - the block parser in many cases removes leading spaces (the indentation) - that information has to be preserved similar to the tabs.
  • Links with reference definitions - how to represent the links that consist of two separate parts.

I think that the first step should be creating a set of tests for various scenarious (focusing on tabs and nested blocks). After that I could try to put together a solution.

Um... And here I thought it would be so easy. I could live without tabs (I always convert them to spaces). I agree about the block indentations.

Tests are a good approach. I'm close to a first release on my markdown editor. After that, I'll look into generating some tests.

This wold be very usefull.
I am trying to replace the custom parser(regex based) with CommonMark.Net in WebEssential 2013 WebEssentials 2013 Issue 1732.
Unfortunately i am having exactly this problem.
If i have time and try to implements a test suite would you be interested in a pull request?

Thanks.

Yes, it would be really great if you could create these tests.

azu/markdown-to-ast might prove useful - they seem to have implemented this in JS. Though I don't see any ready tests in their repository that we could borrow.

I have taken a look at https://github.com/azu/markdown-to-ast.
It does seems they compute the position from the column and line.
I was thinking that probably a better solution would be to simply track the position during parsing (obviously it need to handle tabs correctly).
I will try to put together a test suite anyway.

I have started playing with this. First results (handling only block indentation) are promising.

One thing I added while doing that was the AsEnumerable() method that is now in master since it provides an easy way of testing the results.

Another issue is that I need to refactor at least MatchEmphasisStack and MatchTildeStack since the code there overlap significantly and they are the most complex from inline perspective to get the correct start/end positions.

Cool!
I am a little busy at the moment preparing the pull request for madskristensen/WebEssentials2013#1732.
When i am done with that i will switch to this issue and see if i can help (but you seem to have tackled this issue very fast on your own :) )

Question - is it ok if the positions tracked will include the delimeters? E.g., for an inline code element, the substring represented by the positions will be

``var i = 1;``

so if you need to configure some syntax highlighting for the contents, you would need to do additional parsing to remove the wrapping backticks. The logic to do so is very simple though.

Although it would be no problem to store both positions, I don't really want to since that would increase memory used and thus probably impact performance.

If i read correctly the specification three cases are possible based on how many " ` " are used in the source markdown?

`code`
``code``
```code```

The general idea is that for the purpose of integrating the library in any text editor designed to keep memory pressure low (avoid string allocations) the only requirement is to be able to obtain a series of TextRange(RangeStart, RangeEnd | RangeLength) associated to each AST node.
Each node can be then mapped to a CodeArtifact that categorize the TextRange for some language or style.
The AST could even not contain any string content at all (it could be retrieved by splicing the source buffer based on the positional information in each TextRange. Visual studio expose the text editor in the extensibility model exactly this way)

No, the number of characters is not limited, what you would need to do is to reduce the length and increase the start index as long as the first and last characters of the block are.


The strange thing is that I actually have managed to implement the source position mapping (in the sourcepos branch). For example, for input strings

\0\0   **foo**\r\n\t\0bar \n\n> *quoting **is nice***, `right`?
* animals\r\n* * *horses*\n* * dogs\n* persons\r* 1. john\n* 2. anna

All inline elements are parsed with correct position and length... Removal or \0, different newlines (single or double character), tabs, block indentation - all these are handled.

Currently implemented tests

What is left to do

  • add tracking to remaining inline element types (currently only the ones in the sample above are done).
  • check which cases do additional manipulation with Block.StringContent that might not be tracked.
  • currently one test with multiple tabs is failing.

Question: would you need this information also for block elements or is the StartLine/EndLine enough for practical purposes?

The current web essentials parser has the following classification

  • Bold
  • Italic
  • Header
  • Quote
  • Code
    • Fenced
    • Inline
    • Indented
    public static class MarkdownClassificationTypes
    {
        public const string MarkdownBold = "md_bold";
        public const string MarkdownItalic = "md_italic";
        public const string MarkdownHeader = "md_header";
        public const string MarkdownCode = "md_code";
        public const string MarkdownQuote = "md_quote";

        [Export, Name(MarkdownClassificationTypes.MarkdownBold)]
        public static ClassificationTypeDefinition MarkdownClassificationBold { get; set; }

        [Export, Name(MarkdownClassificationTypes.MarkdownItalic)]
        public static ClassificationTypeDefinition MarkdownClassificationItalic { get; set; }

        [Export, Name(MarkdownClassificationTypes.MarkdownHeader)]
        public static ClassificationTypeDefinition MarkdownClassificationHeader { get; set; }

        [Export, Name(MarkdownClassificationTypes.MarkdownCode)]
        public static ClassificationTypeDefinition MarkdownClassificationCode { get; set; }

        [Export, Name(MarkdownClassificationTypes.MarkdownQuote)]
        public static ClassificationTypeDefinition MarkdownClassificationQuote { get; set; }
    }

At the moment only this types are highlighted in the text editor when editing a markdown file.
To have feature parity with the current implementation, i suppose handling only this cases would do.
But i think that implementing this in a more general purpose would give more value to CommonMark.Net.

OK, I believe the inline source position tracking is now implemented.

I would be very grateful if someone could get the sourcepos branch and play around a bit before I merge it into master.

Edit - found another case not yet covered. But anyway the progress is 99% :)

I will take a look at it tomorrow.
Thanks.

Sorry,
i don't think i will have the time to try this out until this weekend(too much boring work stuff :)).

No problem :)

All of my own test cases currently pass. I also ended up implementing additional performance optimizations so in the end sourcepos branch is even faster than master - the position tracking (not counting the adjusting for tabs/indents) is just a few addition and assignment operations so they by themselves don't do much damage to the performance.

As soon as you will give the green light (just so that someone else has tried it), I will merge the changes and also push a new nuget package.

I am a bit lazy and not sure if it will be useful so for now I will not add SourcePosition and SourceLength to the Block class. I did however, add them to FencedCodeData class since that is the specific feature required in Web Essentials.

Oh... And it also needs quotes... Well, I'll see what I can do... 😄

I have looked at the code and done some testing.
All seem very good.
Having the SourcePosition and SourceLength in the block class could be useful to do a visual matching of block boundaries in the editor without knowing the code block type(fenced, inline, etc...), but i can try to implement it myself later and create a pull request if you don't have time (the same goes for quotes).

I can now try to implement the mapping between the AST and the model used by the visual studio editor and replace the custom parser implemented in web essential :).
I hope to have something ready to show next week.

Thanks,
you have done an outstanding job (and very fast too!).

Commit cc0e1b3 now includes tracking of SourcePosition/SourceLength for all block elements as well. I removed the properties from fenced code data object since the common properties should be used. The difference is though that the block properties will include the fences themselves - you would have to trim them yourself to get the boundaries of the contents.

By the way, the current result would allow you to enable syntax coloring for indented code blocks as well, if you allow the user to specify "default language" in the settings.

I also marked StartLine/StartColumn/EndLine as obsolete. I think that the new approach is much more flexible since the line number is usable only in very simple cases since even word wrapping could make it unusable.

Ok.
I have done some preliminary refactory on the WebEssential codebase to remove the current parser implementation and replace it with CommonMark.Net.
After i have something working i will update this issue or, if you want to close this one, open another to report any additional issue (if any).

Thanks