Knagis/CommonMark.NET

Add auto Github-style ID generation for headers

Closed this issue · 24 comments

I've read issue #44 and the discussion on CommonMark about it. In summary: they haven't decided what to do yet and there are numerous suggestions.

In the meantime, Github has been auto-generating IDs for their headers so that they can be linked to from a TOC for a long time now. This isn't in the CommonMark standard, so I propose that we implement this as a CommonMarkAdditionalFeature.

@Knagis Would it be OK as an additional feature called, say, GithubStyleHeaderIds? If I implemented this would the pull request be accepted?

The discussion you referenced mention many valid points why automatic ID generation should not be done.

I mostly support those arguments - instead I would like to see a way of specifying the correct anchor ID myself in the markdown document.

@Knagis That's why I said it could be added as an additional (non-standard) feature. Sure it may be better to specify a fixed ID but in some cases (and in my case at the moment), a developer may want a document to work the same in CommonMark as it does in Github. This feature would allow that. Also we don't yet have an agreement as to what the CommonMark spec will say on this, so there is nothing yet to implement except the Github way.

There are bigger problems than IDs to get full GitHub flavor markdown support like tables and checkboxes.

IDs can be written using the html # <a name="heading">Heading</a> syntax - this way the author has full control over what the generated name is and this actually works independent on where the document is then rendered - which is the main goal of CommonMark.

In this case I would really prefer a complete example (mostly given in #44, which is about 10 lines excluding the ID generation) in the wiki here that anyone could use but with clear references to the discussion that points out the many flaws with this approach. If we would get to the point where these IDs are the only thing remaining to get full GFM support in CommonMark.NET then it would be moved in the core library.

a name is not valid HTML5 and I'd rather not use it. Github uses IDs prefixed with user-content- so it has a rather specific implementation that needs to be emulated for compatibility. I'm not bothered about tables... and you have said that you are open to a tables implementation in another issue, anyway, so I don't see why you have a problem with this auto-ID stuff here as an additional feature!

@Knagis Come on just say yes. :-)

@jez9999 Have you looked at https://github.com/AMDL/CommonMark.NET/tree/pipe-tables? It's a revamped CommonMark.NET meant to allow creating extensions without changing the core implementation (much). Although I'm currently occupied with other projects, I hope to bring it to completion some time soon. I believe the current design is final; you're more than welcome to implement auto IDs as an extension.

@dmitry-shechtman I shouldn't have thought this would change the core implementation much which is why I'm surprised @Knagis is so resistant to it.

Can you describe what the flag would impact?

For example, how would this be processed:

# Heading

[link](#heading)

Is there a specification of how Github generates the heading ID?

For example:

# *emphasis _etc_*

# [links](/url) in heading

# šņāčÆÿőŒƕƢɱʥͲϢϞШѨ

I think the biggest argument against these auto IDs is that the author of the document can never know what ID will be assigned for these crazy cases...

Is there a specification of how Github generates the heading ID?

According to Pandoc's README, the relevant GFM-compatible extensions are auto_identifiers and ascii_identifiers.

The documentation for the former details the complete ID generation algorithm.

@Knagis The flag would impact all <h1> to <h6> block-level elements, inserting an <a... at the beginning with an ID generated from the heading text. The algorithm pretty simple:
https://github.com/jch/html-pipeline/blob/master/lib/html/pipeline/toc_filter.rb

Basically remove characters that match [^\w\-\ ], and convert spaces to dashes. Also lowercase all the text. Keep a hash table of generated IDs and for each dupe, increment the count and add -x on the end where x is the 0-indexed count. Prefix user-content- to the ID.

Links to fragments would be rewritten to normalize them as above, and prefix user-content- to them. Github uses JavaScript to do that, but I don't think we should put JS in; just link to the actual fragment ID.

The IDs generated in your example cases are:

user-content-emphasis-etc
user-content-links-in-heading
user-content-šņāčÆÿőŒƕƢɱʥͲϢϞШѨ

It's pretty simple to figure out which IDs are generated, especially as (and I would implement this too) a link appears beside each heading which links to the fragment so the author just has to click on that to see where to link to

OK, I give in :)

Though which output format should be used:

  1. <hx id="...">foo</hx>
  2. <hx><a id="...">foo</a></hx>
  3. <hx><a id="..." class="..."></a>foo</hx>

Github uses 3) but that requires additional CSS styles to become useful.

HtmlFormatterSlim has a method InlinesToPlainText() that could probably be used to strip formatting from the heading text, HtmlFormatter has if (RenderPlainTextInlines.Peek()) condition for the same.

Another test case:

# User Content Heading

# Heading

[foo](#user-content-heading)

@Knagis If we're looking to keep close to Github compatibility with this feature, isn't it easier to just do what they do which is use the Regex [^\w\-\ ] to eliminate undesirable characters?

Your test case predictably renders (simplified a little):

<h1><a id="user-content-user-content-heading" class="anchor"><svg...<path...</path></svg></a>User Content Heading</h1>

<h1><a id="user-content-heading" class="anchor"><svg...<path...</path></svg></a>Heading</h1>

<p><a href="#user-content-heading">foo</a></p>

The only functional change I would make is to link to #user-content-user-content-heading so we don't have to inject or rely on any JS. We may as well leave the anchor class on the link in case people want to spruce it up with CSS.

Forget about regular expressions :) The performance hit from them is rather significant (rewriting the original regexps to what you can see in Scanner.cs resulted in 10x perf improvement).

do what they do which is use the Regex [^\w-\ ]

Is it documented that this is the approach they are using? The link posted by @dmitry-shechtman describes a more advanced algorithm.

Also you have to think about when will the ID values be generated. It feels like it should be after the inline parsing is done so that heading like # [link](/url) gets an ID or link. Of course, I haven't checked what pandoc or github does with headings like this, maybe they output #link_url.

Just doing it during formatting is also not going to work since the link with the #heading might precede the actual heading. And you would have to check if the hash matches a heading ID since it might also match a manual <a id="heading">...</a> in which case you would not add the user-content prefix.

@Knagis I'm getting 404 errors for the links that dmitry gave, for some reason.

For # [link](/url) Github generates a heading ID of user-content-link, so the plaintext version of what's in the heading is what is normalized and turned into the ID.

Actually Github adds the user-content- prefix to any <a id... IDs too, so those IDs would need to be rewritten if this feature was enabled. Maybe call the feature "GithubStyleAnchorIds".

I fixed the links, they should work now.

Actually Github adds the user-content- prefix to any <a id...

I don't see how we will be able to do this... It actually requires an HTML parser which for obvious reasons is not included in CommonMark.NET...

@Knagis OK well maybe just focus on generating the header links then and don't worry about manually specified IDs in a tags or rewriting any fragment links?

It might be the best approach for now - it will allow you to avoid too many hard questions... And also Github pages work with the user-content- prefix as well: https://github.com/Knagis/CommonMark.NET#user-content-usage

@Knagis Why is there an HtmlFormatter and an HtmlFormatterSlim? As far as I can tell, HtmlFormatter isn't being used for anything.

HtmlFormatter is extendable but slower. Slim is the default built for max performance.

Sent from my Sony Xperia™ smartphone

---- Jeremy Morton wrote ----

@Knagis Why is there an HtmlFormatter and an HtmlFormatterSlim? As far as I can tell, HtmlFormatter isn't being used for anything.


Reply to this email directly or view it on GitHub.

@Knagis OK I've experimented and Github is not following the behaviour specified in Pandoc's readme; for example, all-numeric headings have their ID generated as just the number, instead of "section", which is what Pandoc specifies. I think it's much more likely Github is using the simple regular expression given in the code I linked to.

Is performance really that much of a problem here given that it's only an additional feature that's off by default? If so could you give me a better alternative to using that regex?

@Knagis I've created a pull request for my implementation with a regular expression. It works perfectly with the HtmlFormatterSlim class (I commented out the Assert.AreEqual on line 53 in Helpers.cs temporarily on my local machine and the tests work), but I can't figure out how to implement it for the HtmlFormatter class because its RenderPlainTextInlines.Peek() call is inside the WriteInline method, which insists on writing to _target, which can not be changed. This means that it will write out the unformatted text to the main HtmlTextWriter instead of a temporary one I create for this purpose. Any ideas?

@Knagis Does anyone even use HtmlFormatter? All your code does it check its output is the same as HtmlFormatterSlim. You say its performance is worse too, so is there any point in its being in there?

HtmlFormatter is used if someone wants to modify the HTML output (such as adding classes, using different tags etc.) - HtmlFormatterSlim does not give that option. There were relatively many requests for that. The tests checks that its default output matches the specification.

Fine. Guess I'll have to just use my fork of your code to actually get stuff done. I don't know how anyone gets anything checked in with your stupid standards and lack of constructive criticism.