remarkjs/remark

[remark-parse] Ordered lists are not recognized if they both use leading zeroes and interrupt a block

Closed this issue · 11 comments

Initial checklist

Affected packages and versions

remark-parse@11.0.0

Link to runnable example

No response

Steps to reproduce

  1. In a new folder, create a new Node module by running e.g. pnpm init.

  2. Run pnpm install remark-parse@11.0.0.

  3. Run pnpm install unified@11.0.3.

  4. Save the code below as repro.mjs file and run node repro.mjs. (I used Node v18.17.0.)

    This will generate a JSON file containing the parsed AST (sans position properties, so that they can be easily diffed) for each of the Markdown snippets it contains.

    repro.mjs
    import { writeFile } from "node:fs/promises";
    import remarkParse from "remark-parse";
    import { unified } from "unified";
    
    const parser = unified().use(remarkParse).freeze();
    
    const documents = {
      noLeadingZeroesFollowing: `The preceeding paragraph.
    
    1. one
    4. two
    `,
    
      leadingZeroesFollowing: `The preceeding paragraph.
    
    01. one
    02. two
    `,
    
      noLeadingZeroesInterrupting: `The preceeding paragraph.
    1. one
    2. two
    `,
    
      leadingZeroesInterrupting: `The preceeding paragraph.
    01. one
    02. two
    `,
    };
    
    function stripPositions(node) {
      const { position, children, ...rest } = node;
    
      return { ...rest, children: children?.map(stripPositions) };
    }
    
    await Promise.all(
      Object.entries(documents).map(([name, text]) =>
        writeFile(
          name + ".json",
          JSON.stringify(stripPositions(parser.parse(text)), undefined, 2),
        ),
      ),
    );
  5. Observe that the files noLeadingZeroesFollowing.json, leadingZeroesFollowing.json, and noLeadingZeroesInterrupting.json are identical and that their root nodes contain both a paragraph node and a list node. However, the root node in leadingZeroesInterrupting.json instead contains only a single paragraph node. Diffing it against any of the other files will produce output similar to the following.

    repro.diff
    --- noLeadingZeroesFollowing.json	2023-10-13 16:11:26.261286672 -0700
    +++ leadingZeroesInterrupting.json	2023-10-13 16:11:26.261286672 -0700
    @@ -6,47 +6,7 @@
          "children": [
            {
              "type": "text",
    -          "value": "The preceeding paragraph."
    -        }
    -      ]
    -    },
    -    {
    -      "type": "list",
    -      "ordered": true,
    -      "start": 1,
    -      "spread": false,
    -      "children": [
    -        {
    -          "type": "listItem",
    -          "spread": false,
    -          "checked": null,
    -          "children": [
    -            {
    -              "type": "paragraph",
    -              "children": [
    -                {
    -                  "type": "text",
    -                  "value": "one"
    -                }
    -              ]
    -            }
    -          ]
    -        },
    -        {
    -          "type": "listItem",
    -          "spread": false,
    -          "checked": null,
    -          "children": [
    -            {
    -              "type": "paragraph",
    -              "children": [
    -                {
    -                  "type": "text",
    -                  "value": "two"
    -                }
    -              ]
    -            }
    -          ]
    +          "value": "The preceeding paragraph.\n01. one\n02. two"
            }
          ]
        }

Expected behavior

Ordered lists should be parsed consistently, regardless of whether their list markers have leading zeroes or the list interrupts a block.

Actual behavior

Ordered lists are recognized as such if their list markers have leading zeroes or they interrupt a block. However, ordered lists are not recognized as such if their list markers have leading zeroes and they interrupt a block.

Runtime

Other (please specify in steps to reproduce)

Package manager

pnpm

OS

Linux

Build and bundle tools

Other (please specify in steps to reproduce)

Apologies for not providing a runnable example, but I spent more time trying (and failing) to get codesandbox to do something useful than I did on the rest of the report. 😅

Thanks @benblank!
Here is the repro in a sandbox https://stackblitz.com/edit/node-mneiet?file=index.js
I'm seeing the same behavior you describe when running remark 15.0.1

Checking the four examples in CommonMark Dingus

  1. https://spec.commonmark.org/dingus/?text=The%20preceeding%20paragraph.%0A%0A1.%20one%0A4.%20two
  2. https://spec.commonmark.org/dingus/?text=The%20preceeding%20paragraph.%0A%0A01.%20one%0A02.%20two
  3. https://spec.commonmark.org/dingus/?text=The%20preceeding%20paragraph.%0A1.%20one%0A2.%20two
  4. https://spec.commonmark.org/dingus/?text=The%20preceeding%20paragraph.%0A01.%20one%0A02.%20two

It does indeed appear all four should produce a list


Tracing further.
I suspect the issue is down one level in micromark, I'm able to replicate the issue without having the AST generated https://stackblitz.com/edit/node-1ygk3h?file=index.js

Hi! This was marked as ready to be worked on! Note that while this is ready to be worked on, nothing is said about priority: it may take a while for this to be solved.

Is this something you can and want to work on?

Team: please use the area/* (to describe the scope of the change), platform/* (if this is related to a specific one), and semver/* and type/* labels to annotate this. If this is first-timers friendly, add good first issue and if this could use help, add help wanted.

I suspect the issue is down one level in micromark, I'm able to replicate the issue without having the AST generated

Ah! Dang. I'd traced it this far down from Prettier and thought I'd gotten to the bottom of it. 🙂

Thanks for all the helpful links!

wooorm commented

I do think the spec is unclear for this:

In order to solve of unwanted lists in paragraphs with hard-wrapped numerals, we allow only lists starting with `1` to interrupt paragraphs. Thus,~

(right above example 304).
As in, I followed those words here.

I think that the current behavior is in line with the reasoning there. Natural language phrases might include 1., but 2. or 01. are more unlikely.

wooorm commented

If you care strongly about this, could you perhaps open an issue with commonmark/commonmark-spec to check what the idea is?

Actually, I missed that when I was reading through the spec. I'm not sure I 100% agree with the reasoning behind it, but those reasons do at least appear to be pretty clear.

I may indeed open up an issue with regards to the phasing, though; I feel the section you quoted would be improved by calling out that it's only referring to ordered lists and to the markers 1. and 1) (not the character 1), even if there are examples demonstrating both cases. The emphasis on the principle of uniformity also suggests that the exception applies to nested lists as well, but I don't see text or an example calling that out.

I also have to admit to being a bit surprised to see "interrupting, not starting with 1" called out as not being valid, simply because when I was checking BabelMark, a large number of the parsers (including nine of the twelve marked as specifically targeting CommonMark) considered it valid.

On the one hand, it's a shame to "disagree" with so many other implementations, but the spec is clear as to what the Right Thing is, and it isn't what I was trying to do. I'll go ahead and close the issue.

Thanks for taking the time to look into this!

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.

Hi team! Could you describe why this has been marked as wontfix?

Thanks,
— bb

Hi team! I don’t know what’s up as there’s no phase label. Please add one so I know where it’s at.

Thanks,
— bb

wooorm commented

There’s a wide variety of parser that all do things differently.
CM likes to be ambiguous on all the edge cases. This also comes as a given when it’s mostly a test suite of input/output examples, and not an explanation of an algorithm (such as HTML).
I’d like a more formal spec. But I can see value in this too.
Anyway, feel free to PR to the spec another example of the 01 case. Then I (and others) will go with the one that’s decided for that!