nvim-neorg/tree-sitter-norg

[Question] Feasability of using this parser as a basis for a markdown tree-sitter parser?

BlackEdder opened this issue · 4 comments

I guess this is more an enquiry than an issue, but I was wondering how realistic it would be to use this parser as the base for a markdown parser. Norg markup seems to be closely related to markdown and markdown is currently missing a tree-sitter parser. A proper markdown parser would be very useful, especially if you have markdown documents with lots of code blocks (e.g. rmarkdown).

Hey! Realistically speaking it shouldn't be too bad. I mean don't get me wrong there would have to be some refactors necessary (especially in the scanner) but at a core level you should be able to parse all of the basic markdown things with just a few tweaks to the APIs we have. We're still working on the attached modifier branch so things like *italic* and **bold** won't work just yet.

The only problem that you'll encounter is when trying to parse more niche parts of markdown without more than 1 char of lookahead. Markdown's "unseen" complexity yields complex parsers. Since edge cases can get seriously messy you'd have to write a large chunk of extra logic to get your desired result. Nothing's impossible though!

Thanks for the quick answer.

The only problem that you'll encounter is when trying to parse more niche parts of markdown without more than 1 char of lookahead. Markdown's "unseen" complexity yields complex parsers. Since edge cases can get seriously messy you'd have to write a large chunk of extra logic to get your desired result. Nothing's impossible though!

I thought tree-sitter allows you to have multiple "paths" when looking ahead and only when it becomes clear what was meant it will choose the path that was correct. Is that what you are referring too?

I thought tree-sitter allows you to have multiple "paths" when looking ahead and only when it becomes clear what was meant it will choose the path that was correct. Is that what you are referring too?

Yeah that's what I was talking about. It's not particularly elegant though. It does allow you to have paths but those "paths" are a bunch of if/switch statements and a cluster of lexer->mark_end(lexer); calls. It's not as great as it may seem on the surface (unless I've been writing parsers wrong my whole life 😅).

You'll have to really plan the structure of your scanner if you want the code to be easily extensible and maintainable :)

Closing this, because a proper markdown parser now exists