Inconsistent slugs with unicode (emoji) characters
drizzer14 opened this issue · 4 comments
Initial checklist
- I read the support docs
- I read the contributing guide
- I agree to follow the code of conduct
- I searched issues and couldn’t find anything (or linked relevant results below)
Problem
In my project, I have unicode characters (emoji) in the headings. Let it be "🏃♂️ Heading".
When the TOC is generated, the output url slugs sometimes contain those emojis, although, according to the github-slugger docs this should not be the case.
I have noticed, that mdast-util-toc, as of time of me writing this issue, contains version 1.0.0 of github-slugger in its package.json, which was released way back on September 22nd 2015. Since then, the emoji standard has evolved quite drastically, and some new emojis are ignored in slugs creation. Thus, 🏃♂️ Heading's slug becomes #%EF%B8%8F-heading, while, e.g. 🏷 Another Heading strips the emoji correctly – #-another-heading.
Solution
While searching github-slugger issues, I have found this particular one, which suggests that their emoji detection algorithm was at least outdated (or even broken).
As seen in their latest update 1.4.0, they now include the generated regex from emoji-regex in their source code, which is kept up-to-date automatically.
The solution I propose is to keep github-slugger dependency up-to-date and bump its version to 1.4.0 in the package.json, which should solve the outdated emoji detection problem.
Maybe, it's also worth including some of the newer emojis in the tests to verify it's still working correctly. The only present emoji in the unicode test (❤️) was also broken in github-slugger at some point in time and later fixed in their 1.1.2 release.
Alternatives
Alternatively, it may be cool to include some config option to transform/map the url slug on the fly so that it can be modified before actually landing on the parsed AST. Kind of a mapSlug function ((slug: string) => string) or a regexp stripping like stripSlug: RegExp in the search.js function.
But I'd rather just bump the github-slugger version and verify the absence of regressions in tests, as another config option would be an overhead, IMO.
Heya!
mdast-util-toc, as of time of me writing this issue, contains version 1.0.0 of github-slugger in its package.json
This statement is incorrect. This package uses ^1.0.0 (note the ^). That means that all the work done on github-slugger over the last 5 years is pulled in already.
For more information, see how semver works here: https://semver.npmjs.com, you can input github-slugger and ^1.0.0 and see that all versions are pulled in.
If you have an older github-slugger in your node_modules, you can run npm update to update.
As seen in their latest update 1.4.0, they now include the generated regex from emoji-regex in their source code, which is kept up-to-date automatically.
Not completely, emoji-regex is not what GitHub uses, instead, it was removed in 1.4.0: Flet/github-slugger@af59f34#diff-e727e4bdf3657fd1d798edcd6b099d6e092f8573cba266154583a746bba0f346R1.
Can you describe more of the problem you’re experiencing? What isn’t working?
@wooorm thanks for the quick reply!
I made one step ahead and haven't looked into the real culprit here – React Markdown 🤦♂️. Looks like somewhere inside it does not parse the heading correctly and outputs the slug with composite emojis included in it, which is actually not what mdast-util-toc does in isolation.
Sorry for not diving deep enough into the problem and thank you for the great tool!
I suppose this issue can be closed, gonna fight with React Markdown from now on 😄.
👍
are you on the latest react-markdown? It should all just work. It could definitely be a bug somewhere, but I’m not sure what the root problem is!