jgm/djot

Proposal: Punctuation Character for Code Block's language specifier

RonaldZielaznicki opened this issue · 6 comments

Proposal

Utilize a punctuation character for fenced code block language specifier.

I use ? below, but would be happy for any character.

Why?

Current syntax introduces ambiguity and breaks djot's design goals. Namely goal 7 and 3 (in spirit).

In addition to resolving these breaks from the goals, it'd allow easier parsing. As is, it's not until we get to the second word of a line that a parser would know the line is the start of a paragraph rather than a code fence.

Rules 3 and 7 - ambiguity and friendly to hard wrapping

Paragraph

Here we have a paragraph that I'd like to introduce a hard break to. Live Demo

``` I'm a paragraph
```
<p><code>I'm a paragraph</code></p>
Code Block

Add a hard break at the wrong spot, and we no longer have a paragraph. Live Demo

``` I'm
a paragraph
```
<pre><code class="language-I'm">a paragraph</code></pre>

What it'd look like

Using a specifier character, we know from the beginning that this is a fenced code block or a paragraph

With Specifier Character

```? I'm not a paragraph
```
<pre><code>I'm not a paragraph</code></pre>

Without Specifier Character

``` I'm
a paragraph```
<p><code>I'm a paragraph</code></p>

With Language

```?djot I'm not a paragraph
```
<pre><code class="language-djot">I'm not a paragraph</code></pre>

Optional Spacing

``` ?djot I'm not a paragraph
```
<pre><code class="language-djot">I'm not a paragraph</code></pre>

Raw Fencing

```
I'm not a paragraph
```
<pre><code>I'm not a paragraph</code></pre>

Additional Benefits

Multiple words are allowed in a code fence, as suggested in #214.

Additional Thoughts

The character doesn't need to be ?. It's just a character I thought would be easy to utilize in this context. = would of been my first choice, but that's used for raw html blocks.

Alternatives

Use tilde instead of backtick

Having punctuation characters pull double duty is what creates the ambiguity. Backtick is used for verbatim and code blocks. Tilde is used for subscript. But, unlike in verbatim, a series of ~ doesn't do anything special for subscripts. I'd lean heavier into this alternative myself, but backticks for code fences is fairly well understood by users and they'd be the ones who'd have to adapt.

Block Attributes

I'd love to use a block attribute to set the language specifier, but that doesn't resolve the ambiguities nor the non-friendliness towards hard breaks.

Related or Similar Issues

#41
#214

jgm commented

Actually, the current djot.js parser does allow ~~~ for code blocks, even though this isn't mentioned in the syntax description. So requiring that is a tempting solution to the ambiguity problem.

On the other hand, using ``` has a pleasing conceptual simplicity; it's not too different from """ for multiline strings and " for inline strings in some languages.

Curious to hear other comments on this.

Actually, the current djot.js parser does allow ~~~ for code blocks, even though this isn't mentioned in the syntax description. So requiring that is a tempting solution to the ambiguity problem.

Yup! It's why I pushed it as an alternative. Glad we're aligned there.

On the other hand, using ``` has a pleasing conceptual simplicity; it's not too different from """ for multiline strings and " > for inline strings in some languages.

It is pretty intuitive at this point to reach for ``` isn't it? I didn't even know about ~~~ as a fence until I glanced at the js implementation to see how it handled the language specifier. Then tried it with github's markdown preview.

Curious to hear other comments on this.

Same. This other issues linked above are already filled with plenty of insights, but I don't think any of them touched on this specific issue/proposal.

Ah, one more comment on:

Actually, the current djot.js parser does allow ~~~ for code blocks, even though this isn't mentioned in the syntax description. So requiring that is a tempting solution to the ambiguity problem.

Requiring ~~~ as the code block fence doesn't completely solve the ambiguity issue unless a blank line followed by ~~~ always becomes a code block. But even then, having a specifier character would help because of the ambiguity caused by hard breaks and whether the first word is a language specifier or not.

Without Hard Break

~~~ I'm a paragraph
~~~

leads to

<p>~~~ I’m a paragraph~~~</p>

With Hard Break

While

~~~ I'm
a paragraph
~~~

becomes

<pre><code class="language-I'm">a paragraph
</code></pre>

The rule being:

A code block starts with a line of three or more consecutive backticks, optionally followed by a language specifier, but nothing else.

Then

``̀` Some things
``

Should perhaps

  • Be a syntax error
  • Or a language specifier "Some things"

(Depending on imposing or not restrictions on language specifiers)

@Omikhleia As is, calling it a syntax error might be difficult. Verbatim text, is:

Verbatim content begins with a string of consecutive backtick characters (`) and ends with an equal-lengthed string of consecutive backtick characters.

Material between the backticks is treated as verbatim text (backslash escapes don’t work there).

If the content starts or ends with a backtick character, a single space is removed between the opening or closing backticks and the content.

If the text to be parsed as inline ends before a closing backtick string is encountered, the verbatim text extends to the end.

This is verbatim text:

`Verbatim text`

but so is this:

```Verbatim text```

After writing that last comment, I think I'm slowly pushing myself over towards using tilde instead of backticks.

So, a code block would look like:

~~~
I am not a paragraph and I have no language specifier
~~~

or

~~~ I am not a paragraph and I have a language specifier
~~~

(Everything after ~~~ but before a new line ends up as the language specifier. "I am not a paragraph and I have a language specifier" in this case)

Which has a number of advantages:

  • We'd get rid of the ambiguity between paragraphs and code blocks.
  • We can't accidentally a paragraph from a code block
  • and we don't add new punctuation syntax.