syntax-tree/mdast-util-toc

html value being turned into string

d4rekanguok opened this issue · 4 comments

Hello!
I'm running into issue with mdast-util-toc flattening html value into string:

import u from 'unist-builder'
import mdastToToc from 'mdast-util-toc'

const mdast = u('root', [
  u('heading', { depth: 1 }, [
    u('text', { value: 'Hello' }),
    u('html', { value: '<code>World<code>' })
  ])
])

const toc = mdastToToc(mdast).map
console.dir(toc, { depth: null })

yeild this result:

{ type: 'list',
  ...
    [ { type: 'link',
         title: null,
         url: '#hellocodeworldcode',
         children: [ { type: 'text', value: 'Hello<code>World<code>' } ] 
    }]
  ...
}

Ideally, I would like to receive something like this:

    [ { type: 'link',
         title: null,
         url: '#helloworld',
         children: [
            { type: 'text', value: 'Hello' },
            { type: 'html', value: '<code>World<code>' },
         ] 
    }]

I found this issue when working on gatsbyjs/gatsby#13608. After extracting a table of contents, it will be transformed into a hast tree via mdast-util-to-hast and finally, html via hast-util-to-html. There are plugins in the ecosystem that might modify markdown heading & inject html in them, thus leaving html artifact in final output:

<ul>
  <li>
    <a href="/demo/#generating-tocs-with-code-classlanguage-textgatsby-transformer-remarkcode\">
    Generating TOCs with &#x3C;code class="language-text">gatsby-transformer-remark&#x3C;/code>
    </a>
  </li>
</ul>

Would mdast-util-toc be open to support this behavior?

Hi @d4rekanguok 👋 thanks for the issue!

I've had a look through this issue, and the related issue. I think what you're suggesting above is possible, I'm just unsure what the impact will be on others. I guess we could make it a semver major change.. I'll wait on some others to comment..

I have a few questions for the gatsby side of things, so I'll ask them on the related issue

I think for us here the problem isn’t just about HTML: it’s that we ignore anything that isn’t plain text (here).

If we’re going a server mayor route anyway, we should consider allowing other values as well like emphasis and the like.

Note that headings can contain phrasing content, but as the TOC entries are meant to be wrapped in a link, we could only clone static phrasing content and should thus ignore links and link references

One (big?) caveat though is that we cannot create the same slugs as GH if HTML is embedded :'(

One (big?) caveat though is that we cannot create the same slugs as GH if HTML is embedded :'(

This seems like a deal breaker, could we potentially store a flat string value and generate a slug for each heading based on that? I really like the possibility of allowing static phrasing content like @wooorm mentioned.

In the mean time, I think it'd still be an improvement if there's a way to remove embed html tags (that doesn't involve blowing the bundle up to a blasphemous size like I did, sorry!)