syntax-tree/mdast

Add concrete syntax details

CupOfTea696 opened this issue · 5 comments

Add concrete syntax details

It would be nice to have concrete syntax information on certain nodes, for example, which bullet type was used for a list item.

Problem

When using a markdown parser to modify markdown and write it back to a file, it would be nice to re-use the same style as the original markdown content. Currently, there is no way to get this information to either use inside a compiler or to set the compiler options.

Expected behaviour

Syntax details included in tree Nodes. Below an example for emphasis

Interface

interface Emphasis <: Parent {
  type: "emphasis"
  character: string?
  children: [TransparentContent]
}

Markdown:

*alpha* _bravo_

Yields:

{
  type: 'paragraph',
  children: [
    {
      type: 'emphasis',
      character: '*',
      children: [{type: 'text', value: 'alpha'}]
    },
    {type: 'text', value: ' '},
    {
      type: 'emphasis',
      character: '_',
      children: [{type: 'text', value: 'bravo'}]
    }
  ]
}

When recompiling the above tree back to Markdown, it would render back to *alpha* _bravo_ rather than *alpha* *bravo*, unless the compiler is explicitly set to use a certain character for emphasis.

Alternatives

This could be implemented without any compiler modifications by having a utility that detects the used syntax and sets the compiler's options accordingly.

I'm not sure this makes sense in mdast specifically.

This document defines a format for representing Markdown as an abstract syntax tree

https://github.com/syntax-tree/mdast#introduction

abstract syntax tree, by design, encode structure, not syntax (which is what a Concrete Syntax Tree would do)

Micromark and CommonMark State Machine (CSM) could enable constructing a concrete syntax tree, and this is noted in the CSM readme:

complete, as it defines different types of tokens and how they are grouped, which allows the format to be represented as a concrete syntax tree

https://github.com/micromark/common-markup-state-machine/blob/0befbfa556fdba5559d35f8f365c2d50be301a1f/readme.md#1-background

This would likely need a new standard (mdcst?) to capture concrete syntax needs.
Since transforms interested in structure (AST), and formatters interested in specific syntax (CST) will have different wants and needs.

Also see previous discussion at syntax-tree/mdast-util-to-markdown#3

@CupOfTea696 If this is needed, you can also use the positional info to access that info by looking characters up in the corresponding vfile!

some of this also discussed here remarkjs/remark#32 and then remarkjs/remark#132 (comment), when remark just made (and still called mdast).

Honestly, I feel that PostCSS and ESTree, which do patch this stuff on nodes, made a mistake: it makes the syntax tree hard to handle

Some more past issues on all this: https://github.com/search?o=desc&q=CST+user%3Amicromark+user%3Aremarkjs+user%3Aunifiedjs+user%3Asyntax-tree&s=created&type=Issues

I’m closing this because I don’t thing such fields should be added to mdast nodes (by default: of course, it’s just json so you can do that yourself if you want).
If/when there is a CST version of mdast, it will be a different project, and I’ll make sure to note it here!