jgm/djot

Is a backslashed space still whitespace?

faelys opened this issue · 5 comments

Hello,

sorry to bother you again. As it might be obvious now, I'm implementing a new djot parser, and trying to match existing behavior. Here is something which surprises me as a user (and is somewhat difficult to fit in my parser architecture, but that's my problem):

This _is an escaped space:\ _ and there is no emphasis.

This _is an actual non-breaking space (U+00A0): _ and there is an emphasis.

As a user, I would have expected \ and U+00A0 to be interchangeable, and not be considered as whitespace as far as syntax goes.

Am I in a minority here? Is it worth a specification update?

Interestingly, currently in the online playground, attributes on the escaped non-breaking space change the behavior:

This _is an escaped space without attributes:\ _ and there is no emphasis.

This _is an escaped space with attributes:\ {.fixed}_ and there is now an emphasis.

This might perhaps be an inconsistency in the current parser or a specification issue on how the inlines nest?

Having looked intensely at the current parser, the current behavior as I understand it is that emphasis and similar marks look for preceding or subsequent whitespace in the raw source text and not in the AST or any semantic representation, so here adding attributes makes the character before _ a closing brace, which is not whitespace.

I guess specifying a rule about raw source whitespace is as legitimate as a rule about semantic whitespace, but I think even as a basic user I would like to be informed of which one it is (just like I think it was useful to spell out that only ASCII whitespace counts, not the whole unicode class).

jgm commented

look for preceding or subsequent whitespace in the raw source text

Correct.

Do you want to make a targeted suggestion about where this should be reflected in the documentation?

Do you want to make a targeted suggestion about where this should be reflected in the documentation?

My specification-reading skill is a bit weird, so you might want other opinions, but as a user I think I would be satisfied with the following additions:

A _ or * can open emphasis only if it is not directly followed by whitespace in the source text. It can close emphasis only if it is not directly preceded by whitespace in the source text, and only if there are some characters besides the delimiter character between the opener and the closer.

The emphases mark the additions, I don't think any emphasis would be needed in the documentation itself. However these would be the first occurrences of the words "source text", I haven't found any established vocabulary to distinguish between source text, semantic interpretation, and "formatted output".

As a parser-writer I would also welcome an update to the example box below that paragraph, showing that _\ can open emphasis and \ _ cannot close it, but I don't know at which point that makes too many examples.

bpj commented

As long as there is no standard way to insert characters by reference (e.g. a symbol looking like a Unicode codepoint in U+XXXX format) this is not good. A \ + U+0020 should probably be equivalent to a U+00A0 everywhere (except inside attributes).