commonmark/commonmark-spec

Stable identifiers for examples

Closed this issue · 25 comments

When a new specification is published, examples get renumbered if a new example is added or if one is suppressed.

This is a bit annoying for referring to examples when the specification evolves, since linking on an example by say printf ("https://spec.commonmark.org/%s/#example-354", version) is not doable reliably.

For example in the cmarkit implementation which includes a layout preserving CommonMark renderer I have a classification of those examples of the specification which do not round trip. It's a bit annoying that it is now desynchronized with the latest specification.

I'm not sure I have a good answer of how to provide that except perhaps automatically insert the current numbers in the current text and then make sure on new additions to provide an unused identifier (e.g. if an example is inserted between examples 1 and 2 you could simply identify it by 1a and make sure the identifier of deleted examples do not get recycled. That's for example how the Unicode Text Segmentation standard proceeds when it ads new rules).

jgm commented

Solution to link issue: always link to a specific version of the spec.

Solution to classification issue: use the sha1 hash of the example text as an identifier rather than the example number.

Solution to link issue: always link to a specific version of the spec.

Not a very user friendly solution. I have various links in the API docs on the standard, you don't really want to link to different versions of the standard in the docs, it's confusing.

Solution to classification issue: use the sha1 hash of the example text as an identifier rather than the example number.

Again I'd like to link back on the specification. Hashes are one way functions ;-)

I agree that having example numbers change between versions is a bit annoying, but I also have no good solution to this.

For links, you could try using text fragment links. #:~:text=[prefix-,]textStart[,textEnd][,-suffix]

For links, you could try using text fragment links. #:~:text=[prefix-,]textStart[,textEnd][,-suffix]

That looks quite brittle and again linking on different version of the specifications is not user friendly. You lure users into reading outdated specification material. CommonMark is already confusing enough to understand :-)

Honestly now that the specification is mostly stable I don't think it would be such a chore to maintain alphanumerical identifiers manually along the scheme I suggested above. I'm not sure anyone really cares about this being a linear sequence of numbers.

That would entail finding a way to specify the id whenever you define an example,
(likely ```example id), make tests_spec.py read this and change the type of the id field of spec.json accordingly and use the new scheme whenever a new example is introduced.

(Bonus point you can know cross-reference examples from the specification text itself if you need to)

You can also solve most of this by changing how you think about CommonMark: it’s a living standard that gets commits.

Same with HTML or Unicode or whatnot: you follow the latest version and link to the text.
And if you care about changes to specs, Git(Hub) links/refs are for that.

The numbers of versions here are meaningless anyway. Because the grammar of markdown (and HTML) is .* (there are no errors; any character does something), that means that any change to the spec is also technically a breaking change.

It's a bit annoying that it is now desynchronized with the latest specification.

Practically, you can improve this when crawling the spec for test cases, by ignoring the numbers, and looking at headings and then the relative number of each test case in them.

You could also use an MD5 or so hash of the markdown.


If we do anything here, I recommend using unique IDs that are not ordered but instead describe the test case. Or auto-generated MD5 hashes. And/or stop using versions (no problem with date snapshots)

And if you care about changes to specs, Git(Hub) links/refs are for that.

I don't care about changes to the spec, I care about being able to point people to the right information bits as the specification changes without losing them in outdated information.

And that:

You can also solve most of this by changing how you think about CommonMark: it’s a living standard that gets commits.

won't help at all I'm afraid.

The numbers of versions here are meaningless anyway. Because the grammar of markdown (and HTML) is .* (there are no errors; any character does something), that means that any change to the spec is also technically a breaking change.

That's a curious point of view. Versions are precisely meaningful since they entail semantic changes on how your markdown is going to be interpreted.

Versions are precisely meaningful since they entail semantic changes on how your markdown is going to be interpreted.

I think you describe Git hashes / snapshots of Git hashes more than a particular number.

Can you elaborate with what you mean by “semantic changes”. Do you mean the text that is in commits/PRs?

That's a curious point of view.

Perhaps you’d be interested in reading more about it: http://trevorjim.com/a-specification-for-markdown/. See also the “With HTML5” link there.
Relating to this current conversation and my point on versions, “HTML5” is referenced there, 12 years ago. Now what exists is HTML. A similar change happened to JavaScript, there’s “ES 2024” now.

And, https://xkcd.com/1172/.

I have various links in the API docs on the standard, you don't really want to link to different versions of the standard in the docs, it's confusing.
#763 (comment)

Even if we used IDs, tests change, they’re removed or split up. Characters are added/removed. Are those the same test? How much difference is a new test case?
Wouldn’t it be good to link to terms or headings than particular test cases?
Or inline the input/output markdown into your docs, so that you can omit unneeded info and focus on what you’re discussing?

Again I'd like to link back on the specification. Hashes are one way functions ;-)
#763 (comment)

I think @jgm correctly inferred that you’re asking about two problems. You are responding about the first here, where John answered the second.

You can hash each example in each snapshot of the spec. Then you know whether an exact example existed in a different snapshot?

From what I can see, you have only shown the case of the hard to maintain exception list?
Perhaps examples of your docs, where you currently have links to test cases, might help me understand your problems better.

Perhaps you’d be interested in reading more about it: http://trevorjim.com/a-specification-for-markdown/.

I know you very much like the trope that the BNF of markdown is .* but it's entirely out of topic here.

Are those the same test?

As long as you keep the spirit of the test (what it conceptually shows) yes.

Can you elaborate with what you mean by “semantic changes”.

Your HTML renderings are going to change from one version to another. A CommonMark implementation abiding by 0.30 won't generate the same HTML as one abiding by 0.31.2. Users are interested in knowing which version of standard their implementation runs on, otherwise the whole point of CommonMark is actually moot :-)

Anyways, I don't want to spend too much energy trying to convince people about the value of stable identifiers across specification versions.

@jgm just tell me whether I will have to cope with this state of affairs or if you are willing to try to do something about it. Otherwise let's just close this and move on.

but it's entirely out of topic here.

This discussion touches on versioning and on how markdown works.

Users are interested in knowing which version of standard their implementation runs on, otherwise the whole point of CommonMark is actually moot :-)

HTML being a living standard or JavaScript using yearly snapshots does not make them moot or not-semantic.

I have not received this question over the years.

Anyways, I don't want to spend too much energy trying to convince people about the value of stable identifiers across specification versions.

I understand open source as discussing problems and consensus seeking.

I don’t understand why jgm’s solution 2. for classification is not acceptable.
I also don’t understand why for 1. linking, using terms/headings, inlining the markdown/html you are discussing, or text fragment links, are not acceptable.
I still wonder whether you think the IDs I suggested (“I recommend using unique IDs that are not ordered but instead describe the test case.”) are acceptable.

jgm commented

@jgm just tell me whether I will have to cope with this state of affairs or if you are willing to try to do something about it. Otherwise let's just close this and move on.

I don't see a very good alternative, yet. I'm not interested in spending my time going through and creating unique ids for hundreds of examples.

jgm commented

By the way, you could make things work with hashes, with a bit of scripting. You just need a script that goes through the latest spec and associates hashes of examples with their numbers. Then you can map your hashes to example numbers.

jgm commented

Well, actually, I suppose we could simply autogenerate identifiers by hashing the example's contents, instead of using identifiers based on numbers.
Then they would be stable.

I mentioned both too. What I worry about, is that it’s essentially the same as those text fragment URLs. But that any one character change to an example changes the hash. One more variation is to take the first letter of the first 3 words and the first letters of the last 3 words as a “hash”. Has to be a bit more involved but something like it is more resistant to changes.

jgm commented

I think changing with any one character change is not a bad thing. Currently, one has no guarantees of link stability across versions. With this change, one would have a guarantee that links would point to the same examples, unless the examples have changed. That doesn't seem so bad. Presumably you don't want to link to an example whose content might have changed?

It’s unclear to me whether the OP deems it as acceptable.
A recent example that would break was all the http to https changes.
I think it’s fine but I don’t really see the original problem: I’d use "main" links and hash myself for the change resistance.

one more alternative that would also solve this: add a sort of “message” to each test case, that can be used by parsers for assertions: assert(fn(markdown), html, message).
This value would be pretty stable and could be turned into unique IDs.
It improves the case where you work on a parser and suddenly “613” or so fails.
It’s work initially, but then IMO worthwhile to maintain.

either are fine to me!

jgm commented

It’s work initially

Yes, unfortunately quite a bit of work, with hundreds of examples.

I'm not interested in spending my time going through and creating unique ids for hundreds of examples.

Let's be pragmatic. I think we agree the specification is not going to change much at that point. Here's a script that numbers the current examples by adding their current number in their info string:

cat << 'EOF' > number-examples.sh
#!/bin/bash

ex=0
function process_line()
{
    echo "$1" | grep '``` example' > /dev/null
    if [[ $? != 0 ]]; then
        echo "$1"
    else
        ((ex++))
        echo "$1 ${ex}"
    fi
}

while IFS='' read -r line || [[ -n "${line}" ]]; do
    process_line "${line}"
done < "$1"
EOF
chmod +x number-examples.sh
./number-examples.sh spec.txt

Then it's a matter of tweaking these lines of spec_tests.py to extract the example_number from the info string rather than linearly count them.

And then just use the alphanumerical convention when changes are made to the spec. That is if someone wants to add examples between 45 and 46, these new example numbers should be manually numbered 45a, 45b, etc.

If people are ok with this scheme. I'm willing to do the work so that this is established for future potential versions of the standard.

(I don't mind hashes but then when you write comments in code, talk to people, mention example numbers in regression tests etc. I much prefer to say example 45 or 45a of the specification than a gibberish of hex numbers. Also I don't mind if the content of one example content may become a 404 or change under the hood now and then, if the example shows the same thing. It's the current number mixup from one version to another that I find annoying to work with as an implementer of the standard).

jgm commented

I think we agree the specification is not going to change much at that point.

Not sure I agree.

Same.

I personally would not want to maintain a unique list of example IDs.

One more alternative that is quick and improves the use case: autogenerated IDs, still numbered, but including the current heading in it, so "atx-heading-6" and such.

jgm commented

@wooorm - are you proposing that the numbering in the generated IDs starts over for every section? That would make the example ids much more stable, in that changes would only change example ids in the same section. But it would have the drawback that the IDs no longer match the displayed example numbers. Unless the proposal is that we display "Example atx-heading-6" instead of "Example 123" as now...?

Right!

a) perhaps we don’t need numbers next to examples, HTML doesn’t either, and from a quick scroll through CM, those numbers (because they are unstable) aren’t used in the text.
b) The text “ATX headings 6” or “Example: ATX headings 6” are 👍 for me

I checked through long headings. They’re not very long so I think it’s fine.
If the length is a worry, I think the long ones can be improved:

  • Entity and numeric character references -> Character references (term that HTML used, “entities” don’t exist anymore, there are named and numeric character references)
  • Container blocks and leaf blocks -> Containers and leaves or Blocks. Or drop the section as it has barely any text and no examples anyway
  • Link reference definitions -> Definitions
  • Emphasis and strong emphasis -> Attention (word I use, as to parsers/markdown it’s more one thing, that later happens to turn into em/strong/i/b

That would make the example ids much more stable, in that changes would only change example ids in the same section.

A section like 'List items' that I'm perusing now has almost 50 examples. At that point I prefer the current status quo which will be easier to work around. At least there is a single linear shift to consider rather having to consider resets at each new heading. Let's not turn a simple problem into a more complicated one :-)

I'm not sure I understand the resistance behind the simple solution I propose. I don't think it introduces any kind of daunting overhead for people working on the spec (and has been shown to work relatively well in other standards).

You several times now have expressed the spec doesn’t change much and so it’s easy. This thing existed for 10 years. It’ll exist for 10 more. A lot happens in 10 years.

For me, it’s that you’re asking to introduce a semver-like versioning scheme, where numbers can’t be reused, with a lot of undecided factors. I see the following unknowns: What if there’s example 50 and 51 already, add 50a? Now what if 50 is removed? What if something is added between 49 and 50a? What if we reorder a section? What if we change an example a little. What if it’s completely rewritten? What if it’s dropped? Where do we document how it works? How to explain it to folks PRing an example?
Likely case is that someone here is maintaining this list you’re asking for in 10 years.

At that point I prefer the current status quo which will be easier to work around. At least there is a single linear shift to consider rather having to consider resets at each new heading. Let's not turn a simple problem into a more complicated one :-)

Why? With this proposal you have relative numbers that at most shift 50.
Why reject this proposal? It’s still a linear shift but shorter. You took the longest region. Your code has the sections already: https://github.com/dbuenzli/cmarkit/blob/ccea66560ed9ccb6089979cbff82886a6abd47a4/test/trip_spec.ml#L10-L131.

and has been shown to work relatively well in other standards

Which ones?

You several times now have expressed the spec doesn’t change much and so it’s easy.

I will also add that the scheme is easy even if the spec changes. All the numbers are written in the spec it's just a matter of picking a nearby non-conflicting alphanumerical identifier (and the extracting program can easily check they are unique).

What if there’s example 50 and 51 already, add 50a? Now what if 50 is removed? What if something is added between 49 and 50a? What if we reorder a section? What if we change an example a little. What if it’s completely rewritten?

I have already mentioned these things earlier, I don't think the sequentiality matters much to perusers of the spec. You are not supposed to recycle numbers but then I'm also not asking for a bullet proof formal solution. It's not a drama if one get reused or if an example changes.

The current scheme is annoying because everything changes when there's little change.

Which ones?

It's mentioned in my first message.

Why reject this proposal?

Because the problem remains and is even worse to track automatically than the status quo.

In any case there's likely more important issues for the spec than this one. Let's not lose to much time on this. (I remain available to do the work if people want to move to something along the lines I proposed).