jgm/djot.js

Multiline references

Closed this issue · 11 comments

Hello,

I was tempted to open an issue on the djot language repository, because I'm really interested in clarifying the specification, but right now the playground is in a bit of a messy state. I'd rather have a clear spec and align every parser on it, but as the reference implementation I would understand if you wanted to put it in order before enshrining it as the djot spec.

Please consider the following djot source:

# Heading
continued

- [test link][Heading
continued]
- [test link][Heading continued]
- foot[^ref bar]

[ref2
foo]: https://example.com/

[^ref
foo]: some note

Currently the playground rejects the link reference (because of the newline), but the heading creates a link reference embedding a newline, so both text link don't match the heading (and as far as I can tell the heading cannot ever be referenced).

However the footnote accepts a multiline reference, but it seems to match only the first part, so here [^ref bar] does refer to [^ref\nfoo]:, which I don't think is intended.

jgm commented

We should presumably do some kind of normalization on whitespace in references to avoid this problem.

The issue with the footnote is a separate problem; that definitely shouldn't happen.

jgm commented

By the way, I don't think [^ref bar] is matching [^ref foo]. The note is stored in the AST as ref\nfoo, and the note reference is to ref bar. Rendering to HTML, we get a note with empty content, not a note with the content "some note' of ref\nfoo.

By the way, I don't think [^ref bar] is matching [^ref foo]. The note is stored in the AST as ref\nfoo, and the note reference is to ref bar. Rendering to HTML, we get a note with empty content, not a note with the content "some note' of ref\nfoo.

Indeed, you're right, I missaw the HTML in the playground (I mistakenly thought the empty footnote generated for the unmatched used reference was caused by the newline-embeddeding definition).

On a separate but somewhat-related point, you might want to look at getUniqueIdentifier: if I understand /[\W\s.]+/ correctly, the \W already includes \s and ., so the pattern is equivalent to /[\W]+/. Moreover, this doesn't match the behavior of djot.lua: here we only keep ASCII letters and digits, while in Lua only the ASCII punctuation is removed, keeping not only ASCII letters and digits, but also any non-ASCII characters.

I also wonder whether this should be lifted into the spec: on the one hand the spec could be left abstract, with ids being parser-dependent as long as they match within a given document, but on the other hand stealing your test/*.test files forces me to match unspecified behavior as well. Would there be value in trying to build a parser-independent test suite?

jgm commented

here we only keep ASCII letters and digits

I guess \w only matches ASCII letters and digits and _! -- I actually hadn't realized that, so this isn't intentional. What about switching to:

/[^\wp{L}\p{N}\p{Z}]/
jgm commented

Parser-independent test suite: I'm open to suggestions here. (And I'm open to putting the identifier scheme in the spec, I'm not sure about that though.) Identifiers are not the only aspect of the HTML output that are left unspecified, of course -- there are things like indentation, whether to use closing tags when they are optional, and so on. One approach is to create a writer for your parser that targets the spec expectations, even if it's not the standard one you want to use.

jgm commented

Another approach would be to use an abstract representation of the ast, such as we now create with the -t ast or -t astpretty options, for the tests. But of course this also requires that you create a special writer just for testing conformity.

By the way, what sort of parser are you working on?

What about switching to: /[^\wp{L}\p{N}\p{Z}]/

It would sadden me a bit to give up on design goal 6 (requiring unicode class). I don't really understand the design trade-offs here, so I can't really judge whether the benefits would be worth such a cost.

Parser-independent test suite: I'm open to suggestions here.

I asked before thinking it through so that I wouldn't have wasted time if it had no chance of getting anywhere. I guessed maybe there would exist somewhere an HTML equivalency checker (as a tool to test minifiers). I'll let you know if I find something useful.

By the way, what sort of parser are you working on?

It's my first real erlang program, to learn the language along the way (and hopefully find someone willing to review it until it reaches a decent level of quality). I stumbled on djot while I was considering how to extend my own markdown dialect to output gemini besides my usual HTML, and found djot to be both simpler and more extensible than anything I could have devised while trying to extend markdown.

It's a character-by-character online parser (it fits well my way of thinking and the pattern-matching paradigm of erlang), bastardized with some limited look-ahead and some checkpoint restoration to emulate backtracking. It builds the ast along the way using a stack of currently-opened elements.

So far I managed to match the HTML (```) and the AST ( ``` a) tests (with 98 cases currently passing), I don't plan to make the filters or the ``` ap or ``` m work. Most of the features work, I'm missing a few block and inline elements.

But maybe what I did were only the 80% which take 20% of the time, right now I'm facing the issue of parsing correctly *[foo](bar* because after seeing (bar it's expecting some raw text and not a delimiter of an inline element (to be fair, after the precedent of djot#109 a case could be made for *[foo][bar*baz]* being an emphasized link, but I would have rather ruled them both the other way).

jgm commented

It would sadden me a bit to give up on design goal 6 (requiring unicode class). I don't really understand the design trade-offs here, so I can't really judge whether the benefits would be worth such a cost.

I forgot all about that! Well, then, we can just construct a regex that excludes the ASCII space and punctuation characters.

Well, then, we can just construct a regex that excludes the ASCII space and punctuation characters.

If it can help, djot.lua already has gsub("[][~!@#$%^&*(){}`,.<>\\|=+/?]","") (so punctuation is dropped rather than gathered with blanks into -), and in my parser there is something functionally equivalent to [!-/:-@[-`{-~] for ASCII punctuation (e.g. for deciding whether \ escapes the following character or not).

jgm commented

I've made a change that should fix the problem. Is there anything else remaining in this issue or can it be closed?

That covers everything I've found so far, so this issue can be closed.

There might still be the question of whether the specification should standardize the HTML ids (I'm inclined to say no), but that's anyway beyond the scope of djot.js and of this issue.