queen-raae/gatsby-remark-oembed

Use without Gatsby

wooorm opened this issue · 11 comments

Hi all! I’m coming here as the maintainer of remark et al, and would like to see a standalone project, for example so that it can be used directly with remark, but also for example with Next or MDSveX.
It seems like there’s only a small layer of Gatsby-specific code here?

For the remark part, there are certain Gatsby-esque things in this project, such as using raw HTML strings: https://github.com/raae/gatsby-remark-oembed/blob/fbca804afd3b1235521db742bb36325c8365ae79/gatsby-remark-oembed/utils/tranformsLinkNodeToOembedNode.js
That works fine if you’re indeed going to make HTML, and Gatsby (and MDX@1) are doing some magic there, but with MDX@2 we’re going remove it in favor of an AST: hast (rehype) (which the other stuff made on remark already supports).

Is this of interest?

raae commented

Yes I would love to that. The plan is to get the oEmbed code into its own module, then a remark module and then use those as the foundation for plugins for different frameworks (Next, Gatsby, Eleventy, etc).

I will take closer look at AST and make sure we comply when we get to this work. Would love to get some review of the code when we get there, or contributions. I have limited time available the next months, but hoping to get more people evolved as soon as I manage to outline the path forward so people feel more comfortable contributing.

raae commented

Totally accidental closing of this issue...

raae commented

Am I correct in thinking that it would be better if the node got replaced with something different than the html type.

So if its an iframe, make the type: element and the tagName: iframe etc. So that hast-util-embedded would return true?

Correct!

The typical flow is:

  • parse a document into mdast
  • do things on mdast nodes
  • turn mdast into hast
  • do things on hast nodes
  • turn the hast into whatever: serialized javascript (MDX), React nodes (react-markdown / rehype-react), Vue/Svelte etc, or serialized HTML (rehype-stringify)

In unified, these steps are up to users (see for example this codesandbox). In MDX, Gatsby, and react-markdown, these are done by those projects.

Markdown nodes can be cleanly mapped to HTML nodes, but we can’t easily map strings of HTML to HTML, because:

  • now you need an HTML parser, which is heavy and complex
  • HTML in markdown doesn’t have to make sense: it could be XML or completely broken
  • Strings of HTML don’t make sense to MDX, because it deals with JSX

The alternative I like best, and is supported by all these tools (including Gatsby for years), is to define how something should turn into hast (html nodes) on mdast (markdown nodes). Here is a little example where the mdast nodes are texts, but it should turn into a span with some certain properties in hast: https://codesandbox.io/s/remark-rehype-debug-forked-zs711?file=/src/index.js:874-1169.
Note though: as you’re getting strings of oembed HTML from server, those need to be parsed somewhere. I like using parse5, and then there’s an adapter to turn it into hast: hast-util-from-parse5

A different thing, gatsby-remark-oembed isn’t really about markdown, it’s more about HTML, so using it as a rehype plugin would make more sense in unified/mdx/mdsvex/next (although I don‘t think gatsby-transformer-remark supports it)

raae commented

Thank you for all this info! Will try to test some of this out :D I had no md experience when making this, so I am not at all committed to the solution and all input is appreciated.

Cool Benedicte! Let me know if I can be of help!

raae commented

Looking at this first quarter of 2021 I see I do not have much time. Will hopefully start splitting the projects up into reusable components. But this work will be postponed until April.

@wooorm I started working on this in a plugin that I think does similar to what your describing. I can get to the point that I have an hast for the embed, but I'm not sure what do do with it.

I get an object like this:

    {
      "type": "root",
      "children": [
        {
          "type": "element",
          "tagName": "iframe",
          "properties": {
            "width": 200,
            "height": 150,
            "src": "https://www.youtube.com/embed/K-281doxOMc?feature=oembed",
            "frameBorder": "0",
            "allow": "accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture",
            "allowFullScreen": true
          },
          "children": []
        }
      ],
      "data": {
        "quirksMode": false
      }
    }

For this:

https://www.youtube.com/watch?v=K-281doxOMc

My basic test is this, for now:

const contents = await remark()
    .use(parseMarkdown)
    .use(oembed)
    .process(doc)
  console.log(contents)

Would appreciate any guidance, or a code-review.

Few things:

a) this won’t find all URLs in text. If you want links in text, you can use remark-gfm, which turns “pasted” URLs in text into actual links. Then in the AST you can look for those parsed nodes.
(Also: see the next point)

unified()
  .use(remarkGfm)

b) The goal you want to achieve isn’t specific to markdown. Sure, your input currently is, but it would both help to make your solution simpler, and help other users who at some point deal with HTML, to do this work in a rehype/hast/html plugin:

import {visit, SKIP} from 'unist-util-visit'

const myRehypePlugin = () => (tree) => {
  // List of `[paragraph, endpoint, url]`s.
  const pairs = []

  visit(tree, 'element', (element, _, parent) => {
    // A link on its own specifically in a paragraph:
    if (element.tagName === 'a' && parent && parent.type === 'element' && parent.tagName === 'p' && parent.children.length === 1) {
      const url = element.properties.href
      const {endpoint} = findProviderAndEndpoint(url)

      // Known endpoint.
      if (endpoint) {
        pairs.push([parent, endpoint, url])
        return SKIP
      }
    }
  })

  return await Promise.all(pairs.map(async ([element, endpoint, url]) => {
    const oembed = await (await fetch(`${endpoint.url.replace(/\{format\}/g, 'json')}?url=${url}`)).json()

    if (oembed) {
      element.children = fromParse5(parseFragment(oembed.html), tree)
    }
  }
}

c) The act of storing hast as a subtree into mdast, has to do with how mdast is turned into hast. Hence it’s described in mdast-util-to-hast.

Thanks for taking a look. This is structured a bit different, and looks like it might work better. Using visit instead of selectAll makes it easier to replace element.children. Makes sense! I actually started with visit, following some howto, and scrapped it because it wasn't working great with the async stuff I was doing. I will need to double-check that visit actually finishes there, before returning.

a) this won’t find all URLs in text.
b) The goal you want to achieve isn’t specific to markdown.

I am looking for any url on a line by itself. First I get paragraph text then I check each one for url-validity. I don't want to match any URL or link, or links in text (to follow the behavior of this plugin, with prefixes turned off) for example this should not be replaced with an embed:

[check out this tweet](https://twitter.com/Chhapiness/status/1422326068917284869)

I think it's easy enough to just check links that have the same inner-text as href, though.

c) The act of storing hast as a subtree into mdast, has to do with how mdast is turned into hast.

Good advice about sticking with HTML/hast. I have never made a rehype/hast/html plugin and can't seem to get a clear idea of what current best (or common) practices are, but it sounds fine to operate in HAST-space, if that is typical or more widely useful.

Thanks again!

I will need to double-check that visit actually finishes there, before returning.

Indeed, the visitor passed to visit can’t be async (or well, it can be, but that results in a race condition). You can either save the results from one walk in a list, and after walking, then do the async stuff on them (like I did in my example), or save a list of promises, and await them after the walk.

I think it's easy enough to just check links that have the same inner-text as href, though.

Your code would also match these three URLs: https://a.com**https://b.com**https://c.com.
That’s why I’d instead suggest on looking for a single link in a paragraph.

and can't seem to get a clear idea of what current best (or common) practices are, but it sounds fine to operate in HAST-space, if that is typical or more widely useful.

Just to make sure it’s clear: my example is a rehype plugin.

If your code cares about HTML things and semantics, do it in rehype.
If you care about markdown, such as having to differentiate between different link nodes (reference, resource, etc), do it in remark!

👍👋