commonmark/commonmark-spec

the definition of "line" should precede its usage in the spec?

greggman opened this issue · 9 comments

I'm new to this spec and probably missing things but, starting from the top down I see section 2.1

A line is a sequence of zero or more characters other than line feed (U+000A) or carriage return (U+000D), followed by a line ending or by the end of file.

Then I get to 4.1

4.1 Thematic breaks

A line consisting of optionally up to three spaces of indentation, followed by a sequence of three or more matching -, _, or * characters, each followed optionally by any number of spaces or tabs, forms a thematic break.

...

Four spaces of indentation is too many...

So then I go test this markdown

* foo
   * foo
 
     ***

That *** has more than 3 spaces so fails the rules just specified. My guess is that the actual definition of line in 4.1 is supposed to include trimming the current block's indentiation level? Maybe that gets covered elsewhere but it seemed worth mentioning that 4.1 seems to be relying on what at least until that point is unspecified behavior.

You're putting things in lists. Lists are covered later. I'd think that it makes sense that when you use constructs explained later, those constructs can interfere with earlier explanations?

Most specs I've read would not bury exceptions to their defintions deep in the spec.

Also, having just read through the list section I'm not seeing where the definition of line in 2.1 and 4.1 is upadted

Note the same issue happens with block quotes

> foo
>
> ***

That > *** does not fit the defintion from 4.1 and 2.1

jgm commented

There's a basic sufficient condition for a thematic break, which you quoted. And then the block quote spec says: if you have a sequence of lines that constitute an X, and you prepend > to each line, then you have a block quote whose contents are an X.

Together these rules imply that

> ***

is a thematic break inside a block quote. Hope that helps.

Just FYI, I was not looking for help. I was suggesting the spec would be better if the defintion at the top of the spec wasn't modified later. You apparently disagree but as an opinion of one (me) I found the fact that it starts of with the defintion of a line in 2.1, and the defintion of a thematic break in 4.1, both defined in a specific and IMO misleading way, and then 2 or more sections far far below that effectively re-define what a line is ... a less than ideal way to specify something.

Ideally you don't have to keep the entire spec in your head to understand a portion of it but as it is, proceeding from the top down will start the user with a erroneous mental model IMO. I'm suggesting that re-writing the top part of the spec to say that a line might have a prefix that gets removed would give the reader the correct mental model from the start.

jgm commented

It's just not true that thematic break gets "redefined" later. Think of the spec as an inductive definition with base cases for the leaf elements and an inductive clause for the containers.

There's a common definition of "line" pretty much every programmer assumes. It's the result you get when you do

lines = someText.split('\n')

or equivalent in some other language (and or whatever for dos line endings)

The definition line at 2.1 does nothing to dispel the reader that that definition is invalid.

Anyway, I'm not here to argue. I found the spec unclear and misleading. That's my opinion. You can take it or leave it. Just remember, you are intimately familiar with the spec, so in your mind you already know everything there is to know about it. The new reader does not have that. They come to it with the knowledge of what "line" means since they've worked with lines all their careers, hobbies, classes. They get to 2.1 and they'll most likely think "yep, a line is exactly what it's always been every time I've split a string into lines or read a file one line at a time".

It's not until deep into the spec that it's pointed out that mental model is wrong.

I agree that the term line is currently used with two or maybe even three meanings in the specification. The second and the third are hardly distinguished:

  1. A code line beginning at start of file or immediately after a line ending as specified in 2.1. Note that the prose there currently does not say where a line starts, only where it ends and what it may contain.
  2. A content line which is a code line stripped of any block markers and inherited indentation from the preceding code line, but including its own line prefix and added indentation before and after it as well as the optional line suffix in ATX headings.
  3. A text line is a trimmed content line with all prefixes and suffixes and trailing whitespace removed. Maybe this includes collapsing soft line breaks and joining consecutive content lines of the same block.

Section 2.1 is one of the first parts of the spec. It’s a preliminary on how character (groups) are called, which are used later in the spec. What is a character? What’s whitespace or punctuation?

One such character group is zero or more characters that are not line endings: a line.
As @Crissov notes, there’s nothing in the phrasing that indicates a line must start after a line ending in the whole document or at the beginning of the whole document.
The idea that the term line is equivalent to line = wholeDocument.split('\n')[0] does not match what’s written in the spec.

There's a common definition of "line" pretty much every programmer assumes.

I don’t really subscribe to the idea that “everyone” thinks that that is what lines are. And if so, the section 2.1 explains that term is used in the spec as a sequence of characters. If you missed 2.1, we could link to it from 4.1.

Anyway, I'm not here to argue.

To me, your comments come across as the inverse of this. You’re using hand-wavy terms of “every programmer assumes”, “all their careers, hobbies, classes”. Things like “Just remember”, “You can take it or leave it”, all paint a picture of “you and everyone are right”, while the tiny group of people who work on this are completely wrong.

I also haven’t heard any practical improvement you suggest?

To me, your comments come across as the inverse of this. You’re using hand-wavy terms of “every programmer assumes”, “all their careers, hobbies, classes”. Things like “Just remember”, “You can take it or leave it”, all paint a picture of “you and everyone are right”, while the tiny group of people who work on this are completely wrong.

That's a very uncharitable view. I said it's an opinion. My opinion is correct for me. If I say this liquorice candy tastes bad to me and you say it tastes great I'm not telling you you are wrong. But, I am not wrong either as it applies to me. If I stay something is hard to understand for me, that's truth for me, whether or not it's a truth of you. So the comment of take it or leave it is "this is my point of view, it confused me, my experience suggests it will confuse others but if you disgree and feel it's unlikely to confuse others free it ignore it"

To me the spec is confusing because I've parsed 1000s of files before and I've written code like

lines = content.split(/\r\n|\n|\r/)
for (line of lines) {
   ...

2.1 doesn't tell me that the variable line in that code snippet is wrong so I could easily write that code and then only later find out. Oh, that's not what they mean by "line". The line in the code above is really some kind of meta line

metaLines = content.split(/\r\n|\n|\r/)
for (metaLine of metaLines) {
   line = extractLineFromMetaLine(metaLine)

Concretely, in my opinion it would be better to make that clear in 2.1.

line is a sequence of zero or more characters other than line feed (U+000A) or carriage return (U+000D), followed by a line ending or by the end of file. A line does not nessesarily start after a line ending as it may be proceeded by various block constructions.

Or something to that effect.

Now I immediately know this defintion is different than the concept of line in just about every example of "lines of a string" ever written. That seems like it would confuse less people even if less = 1, me.