Proposal: Punctuation Character for Code Block's language specifier
RonaldZielaznicki opened this issue · 6 comments
Proposal
Utilize a punctuation character for fenced code block language specifier.
I use ?
below, but would be happy for any character.
Why?
Current syntax introduces ambiguity and breaks djot's design goals. Namely goal 7 and 3 (in spirit).
In addition to resolving these breaks from the goals, it'd allow easier parsing. As is, it's not until we get to the second word of a line that a parser would know the line is the start of a paragraph rather than a code fence.
Rules 3 and 7 - ambiguity and friendly to hard wrapping
Paragraph
Here we have a paragraph that I'd like to introduce a hard break to. Live Demo
``` I'm a paragraph
```
<p><code>I'm a paragraph</code></p>
Code Block
Add a hard break at the wrong spot, and we no longer have a paragraph. Live Demo
``` I'm
a paragraph
```
<pre><code class="language-I'm">a paragraph</code></pre>
What it'd look like
Using a specifier character, we know from the beginning that this is a fenced code block or a paragraph
With Specifier Character
```? I'm not a paragraph
```
<pre><code>I'm not a paragraph</code></pre>
Without Specifier Character
``` I'm
a paragraph```
<p><code>I'm a paragraph</code></p>
With Language
```?djot I'm not a paragraph
```
<pre><code class="language-djot">I'm not a paragraph</code></pre>
Optional Spacing
``` ?djot I'm not a paragraph
```
<pre><code class="language-djot">I'm not a paragraph</code></pre>
Raw Fencing
```
I'm not a paragraph
```
<pre><code>I'm not a paragraph</code></pre>
Additional Benefits
Multiple words are allowed in a code fence, as suggested in #214.
Additional Thoughts
The character doesn't need to be ?
. It's just a character I thought would be easy to utilize in this context. =
would of been my first choice, but that's used for raw html blocks.
Alternatives
Use tilde instead of backtick
Having punctuation characters pull double duty is what creates the ambiguity. Backtick is used for verbatim and code blocks. Tilde is used for subscript. But, unlike in verbatim, a series of ~
doesn't do anything special for subscripts. I'd lean heavier into this alternative myself, but backticks for code fences is fairly well understood by users and they'd be the ones who'd have to adapt.
Block Attributes
I'd love to use a block attribute to set the language specifier, but that doesn't resolve the ambiguities nor the non-friendliness towards hard breaks.
Related or Similar Issues
Actually, the current djot.js parser does allow ~~~
for code blocks, even though this isn't mentioned in the syntax description. So requiring that is a tempting solution to the ambiguity problem.
On the other hand, using ```
has a pleasing conceptual simplicity; it's not too different from """
for multiline strings and "
for inline strings in some languages.
Curious to hear other comments on this.
Actually, the current djot.js parser does allow ~~~ for code blocks, even though this isn't mentioned in the syntax description. So requiring that is a tempting solution to the ambiguity problem.
Yup! It's why I pushed it as an alternative. Glad we're aligned there.
On the other hand, using ``` has a pleasing conceptual simplicity; it's not too different from """ for multiline strings and " > for inline strings in some languages.
It is pretty intuitive at this point to reach for ``` isn't it? I didn't even know about ~~~
as a fence until I glanced at the js implementation to see how it handled the language specifier. Then tried it with github's markdown preview.
Curious to hear other comments on this.
Same. This other issues linked above are already filled with plenty of insights, but I don't think any of them touched on this specific issue/proposal.
Ah, one more comment on:
Actually, the current djot.js parser does allow ~~~ for code blocks, even though this isn't mentioned in the syntax description. So requiring that is a tempting solution to the ambiguity problem.
Requiring ~~~
as the code block fence doesn't completely solve the ambiguity issue unless a blank line followed by ~~~
always becomes a code block. But even then, having a specifier character would help because of the ambiguity caused by hard breaks and whether the first word is a language specifier or not.
Without Hard Break
~~~ I'm a paragraph
~~~
leads to
<p>~~~ I’m a paragraph~~~</p>
With Hard Break
While
~~~ I'm
a paragraph
~~~
becomes
<pre><code class="language-I'm">a paragraph
</code></pre>
The rule being:
A code block starts with a line of three or more consecutive backticks, optionally followed by a language specifier, but nothing else.
Then
``̀` Some things
``
Should perhaps
- Be a syntax error
- Or a language specifier "Some things"
(Depending on imposing or not restrictions on language specifiers)
@Omikhleia As is, calling it a syntax error might be difficult. Verbatim text, is:
Verbatim content begins with a string of consecutive backtick characters (`) and ends with an equal-lengthed string of consecutive backtick characters.
Material between the backticks is treated as verbatim text (backslash escapes don’t work there).
If the content starts or ends with a backtick character, a single space is removed between the opening or closing backticks and the content.
If the text to be parsed as inline ends before a closing backtick string is encountered, the verbatim text extends to the end.
This is verbatim text:
`Verbatim text`
but so is this:
```Verbatim text```
After writing that last comment, I think I'm slowly pushing myself over towards using tilde instead of backticks.
So, a code block would look like:
~~~
I am not a paragraph and I have no language specifier
~~~
or
~~~ I am not a paragraph and I have a language specifier
~~~
(Everything after ~~~
but before a new line ends up as the language specifier. "I am not a paragraph and I have a language specifier" in this case)
Which has a number of advantages:
- We'd get rid of the ambiguity between paragraphs and code blocks.
- We can't accidentally a paragraph from a code block
- and we don't add new punctuation syntax.