syntax-tree/mdast

Is the `uri` field RFC3986 compliant?

koraa opened this issue · 2 comments

koraa commented

Hi; thanks for the great library :)

Now to the point:

Parsing the following markdown

# Foo

Hello World [link](<foo
bar>)

I get the following MDAST (represented as YAML, position info stripped):

type: root
children:
- type: heading
  depth: 1
  children:
  - type: text
    value: Foo
- type: paragraph
  children:
  - type: text
    value: "Hello World "
  - type: link
    title: ~
    url: "foo\nbar"
    children:
    - type: text
      value: link

The AST looks pretty much as expected; the newline in the link is handled by just including a newline (0x20) character in the string; which also seems alright, but caused some problems for us when using the AST, because we expected URLs to be RFC3986 compliant; RFC3986 mandates that most special characters be percent-encoded.

Is this expected behavior?

We specifically ran into this issue when using json-schema to validate our mdast; do you have any reccomendation on what the best way would be to validate whether a mdast is comliant?

The markdown you posted is valid by default, but not if CommonMark is turned on, as you can see rendered here on GitHub:

Hello World [link](<foo
bar>)

Hello World [link]()

...because in CommonMark, white-space cannot be in this construct (it’s called an autolink).
I suggest using CommonMark, and against using white-space in links.

We do not change URLs in mdast, so that we can also create markdown again. This is expected behaviour, so I suggest using a laxer JSON schema.

However, if we’re going to a format like HTML, I do think we should encode the URLs. If you’re doing something like that and it doesn’t work, please let us know.

koraa commented

Ok! Thanks for the clarification!