jgm/djot

Allow any valid HTML4 identifier string to be a djot identifier string

Opened this issue · 2 comments

Thanks for your work; I am excited about using this project.

I'm converting some markdown files to djot with pandoc and am hitting an unfortunate behavior. I'm uncertain if it's a bug in pandoc's djot writer, a needed change in djot, or neither; would be willing to contribute in either codebase if it is one.

echo "# R."  | pandoc -f markdown -t djot

produces


{#r.}
# R.

which parses to

doc
  para
    str text="{#r.}"
    soft_break
    str text="# R."

The same text without the period at the end compiles to the desired

doc
  section id="r"
    heading level=1
      str text="R"

Of course djot can set whatever rules it wants on what belongs in an ID, which implies the pandoc writer should not be writing a djot-invalid identifier; but unless I'm missing something the simpler solution would seem to be allowing any valid SGML and HTML4 identifier to be a valid djot identifier, where "ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".")."

jgm commented

Currently the syntax for attributes (undocumented except in code comments) is

 attributes <- '{' whitespace* attribute (whitespace attribute)* whitespace* '}'
 attribute <- identifier | class | keyval
 identifier <- '#' name
 class <- '.' name
 name <- (nonspace, nonpunctuation other than ':', '_', '-')+
 keyval <- key '=' val
 key <- (ASCII_ALPHANUM | ':' | '_' | '-')+
 val <- bareval | quotedval
 bareval <- (ASCII_ALPHANUM | ':' | '_' | '-')+
 quotedval <- '"' ([^"] | '\"') '"'

So we don't allow . in an identifer. I can't recall whether there was a specific reason for this.
XML identifiers are more restrictive than this (must start with letter or underscore). HTML4 identifiers are less restrictive, and HTML5 identifiers are much less restrictive.

Class names have more restrictions (at least if they're to be used with CSS).

EDIT: Anyway, I'm open to making this less restrictive, but some thought needs to go into what would be a reasonable restriction.

At first glance, it seems like djot has a principal to not distinguish between the first character and other characters in ids, possibly for simplicity of implementation? Which dictates that . can't appear in ids because '.' name indicates a class? Or possibly it's just that classes and ids follow the same pattern, and class name in djot may not contain periods (which I agree is a good decision).

HTML4 identifiers are less restrictive

As I understand it HTML4 ids are generally extremely restrictive, because they follow the SGML rules laid out ISO 8879:1986. #1, #:, and are all invalid HTML4 identifiers or class names, but valid djot identifiers because they don't start with [A-Za-z].

The only case I see where djot is more restrictive than HTML4 is that "foo.bar" is a valid HTML4 identifier but an invalid djot identifier because it contains a .This difference prevents a lot of pretty basic ascii-encoded HTML4 from being able round-trip through djot back to HTML.

I have one firm proposal, which is to disentangle the identifier and class rules to allow non-initial identifier characters to be periods. I.e.:

 identifier <- '#' nameChar Maybe[subsequentIdChar+]
 class <- '.' nameChar+
 nameChar <- (nonspace, nonpunctuation other than ':', '_', '-')
 subsequentIdChar <- (nonspace, nonpunctuation other than ':', '_', '-', '.')

My goals would be served equally well by the parser accepting periods on ids in any position but requiring them to be escaped (\.). But that feels uglier.

I don't have opinions about any larger related changes, though I do like how unicode characters can be id and class names in djot.


Just for context, I should possibly say that my interests here are not primarily in writing in DJOT, but in getting things into djot's AST, which is much nicer to work with than pandoc's for my purposes.