rehypejs/rehype-minify

`rehype-remove-comments`: ability to preserve specific comments

didnotwant opened this issue · 16 comments

Initial checklist

Problem

On one of my projects I use nginx SSI Commands to include some company-wide HTML content to one of my pages – the injected HTML just replaces a specific set of HTML comments.
Currently the rehype-remove-comments only allows to set removeConditional flag so that the IE comments can be kept.
I imagine that my use-case could be not the only one in the world – that's why I decided to file this issue.

Solution

My proposal is to add an option called exclude where one can pass an array of regular expressions matching specific content of HTML comments.

Example usage in .rehyperc:

export default {
  plugins: [
    ['./build/rehype-remove-html-comments.mjs', {
      exclude: [
        // We want to preserve HTML comments containing SSI includes commands
        // handled by nginx. See: https://nginx.org/en/docs/http/ngx_http_ssi_module.html
        /# (block|endblock|include)/,
        // … some other use-cases if needed
      ],
    }],
  ],
}

But I'm not sure if such way of configuring it is acceptable in all possible usages of the plugin (I think that currently there's just a flag for a specific reason, e.g. it's easier to use the tools in CLI?).

I created a custom plugin for my repo (I believe as a temporary solution, if my idea is good enough to the maintainers of rehype-minify) which is mostly the same as the current rehype-remove-comments plugin, so let me share the code I prepared:

import {visit} from 'unist-util-visit';

export default function rehypeRemoveHtmlComments(options = {}) {
  const shouldPreserve = (node) => {
    if (!('exclude' in options)) {
      return false;
    }

    if (!Array.isArray(options.exclude)) {
      throw new Error('The `exclude` option requires passing an array.');
    }

    if (options.exclude.some((maybeRegex) => !(maybeRegex instanceof RegExp))) {
      throw new Error('Elements of `exclude` have to be regular expressions.');
    }

    return options.exclude.some((regex) => regex.test(node.value));
  };

  return function (tree) {
    visit(tree, 'comment', function (node, index, parent) {
      if (
        typeof index === 'number' &&
        parent &&
        !shouldPreserve(node)
      ) {
        parent.children.splice(index, 1)
        return index;
      }
    })
  }
}

Alternatives

I wasn't able to find any alternative plugins that could allow doing exactly what I expect in the description of the Problem.
If there are some and anyone knows them, please share with me!

I like it, but I think I would call it preserve and make it accept a regular expression or a function that returns a boolean. This supports more complex use cases.

rehype()
  .use(rehypeRemoveComments, {
    preserve: /# (block|endblock|include)/
  })

rehype()
  .use(rehypeRemoveComments, {
    preserve(content) {
      if (content.includes('# block')) {
        return true
      }

      if (content.includes('# endblock')) {
        return true
      }

      if (content.includes('# include')) {
        return true
      }

      return false
    }
  })

Taking a step back, is an option needed at all?
It sounds like the only comments would be NGINX commands.
So all comments could be kept.
By turning this part of the minifier off.

I'm not entirely opposed to adding an option, but it does feel very niche to me.

Even the IE directive pass through could and potentially should be removed in the next major.
IE has dropped below 0.5% of global browser share and keeps dropping.

preserve

Why not just go with test(node: Comment) / test(value: string)?

Even the IE directive pass through could and potentially should be removed in the next major.

👍


Taking a step back, is an option needed at all?
It sounds like the only comments would be NGINX commands.
So all comments could be kept.

I’m 👍 on this Q. Are there comments that you want to drop? Is turning off this plugin to keep all comments viable?

@remcohaszing – good point, I already had the idea on just a regex instead of the array of them, but it seemed to me not too flexible if I ever wanted to preserve more cases of comments. The function is even better in terms of the readability!


@ChristianMurphy, @wooorm – your approach on permanently disabling the HTML comments trimming is what I exactly had done before I decided to have the option I proposed in this issue, so let me add more context to this.

The rehype is used in my repository with statically generated pages (using Astro), where the pages could be in different formats, depending on what's needed for a specific page: .astro, .mdx or even plain .html. The Astro and MDX (as I guess) allow to add comments that won't be rendered in production, the HTML – doesn't. In all of them it's possible to use the HTML comments, so there's a chance that something would leak to prod if the HTML comments aren't removed.

I could use only the non-HTML comments, but I'm not the only member of the team, and we don't want to have to remember about this rule each time we change something in the repo, and also I think it would be hard to establish some linting rule (or a rule for astro check) which ensures that we don't write HTML comments on pages (with the exception example I mentioned: nginx commands).

What do you guys think?

Why not just go with test(node: Comment) / test(value: string)?

Both the names test / preserve are equally fine with me. I think test is used more often within the unified ecosystem, so that makes sense.

Also passing either the Comment node or the string value is fine with me. Passing a RegExp would be nice, but this could also be implemented by the user in the test function.

Taking a step back, is an option needed at all?

IMO it makes sense. HTML comments are part of the DOM and can have actual meaning. IIRC AngularJS used them?

Terser also has this option (comments). A use case in their documentation is to preserve legal comments.

I think test is used more often within the unified ecosystem, so that makes sense.

Indeed.

Passing a RegExp would be nice, but this could also be implemented by the user in the test function.

True. And, preferably we’d support a string too. So that JSON config files are supported.

IIRC AngularJS used them?

How?
Isn’t how Angular uses it the same as how React uses them. Which is exactly why humans can‘t render comments in React, but the framework does?

Terser also has this option (comments). A use case in their documentation is to preserve legal comments.

A valid use case, but it seems to apply to JS/CSS/other code — I never see such copyright info in HTML comments. HTML has other ways to show that (<meta>, <link>?). I also don’t see it in other markup/content languages (markdown, svg, xml?)


it's possible to use the HTML comments, so there's a chance that something would leak to prod if the HTML comments aren't removed.

If people can write HTML comments and leak things, they could leak things in any other way: they could leak outside of comments.
It seems like security is your main concern, but I’m not sure security is helped much by this? 🤔
The main reason this exists is for performance instead: so that the HTML sent over the wire is smaller.

which ensures that we don't write HTML comments on pages (with the exception example I mentioned: nginx commands).

Wouldn’t it make sense to use rehype to minify and remove all comments after nginx commands?
Or, to disallow them, as there is already a good alternative: if people want to do such things, they could use Astro / MDX?
rehype-minify is a minifier, it makes things ugly, I’d expect it to interfere with nginx.


I’m leaning on 👍 for such a feature, but I do wonder about the story for this use case.

Passing a RegExp would be nice, but this could also be implemented by the user in the test function.

True. And, preferably we’d support a string too. So that JSON config files are supported.

Personally I lean against accepting a string until this is something a user actually asks for. They can use JavaScript config files, but I think typically this is used programmatically.

IIRC AngularJS used them?

How? Isn’t how Angular uses it the same as how React uses them. Which is exactly why humans can‘t render comments in React, but the framework does?

Yeah, I think something like that. It has been a while.

Terser also has this option (comments). A use case in their documentation is to preserve legal comments.

A valid use case, but it seems to apply to JS/CSS/other code — I never see such copyright info in HTML comments. HTML has other ways to show that (<meta>, <link>?). I also don’t see it in other markup/content languages (markdown, svg, xml?)

I’ve seen such comments while viewing the HTML source of miscellaneous websites over the course of years. I don’t recall which websites exactly, but I know it happens.

Personally I lean against accepting a string until this is something a user actually asks for. They can use JavaScript config files, but I think typically this is used programmatically.

I think this is generally fair—to wait for users to request things—however, all current plugins (as far as I am aware) support the non-JS config so that users don’t have to use the JS config file.

I’ve seen such comments while viewing the HTML source of miscellaneous websites over the course of years. I don’t recall which websites exactly, but I know it happens.

This seems a bit iffy to me 🤔 Humans will see the content, so if there’s copyright on that content, you’d want that available to those humans too. Not just to programmars who view source?


@didnotwant could you comment on my questions to you in #50 (comment)? About security and nginx commands? Thanks!

Humans will see the content, so if there’s copyright on that content, you’d want that available to those humans too. Not just to programmars who view source?

I’m not judging, only observing. I wouldn’t choose for this either. The exact same thing can be said about JavaScript minifiers preserving such comments.

Hi! Thanks for taking the time to contribute! This has been marked by a maintainer as needing more info. It’s not clear yet whether this is an issue. Here are a couple tips:

  • Spend time framing the issue! The more time you put into it, the more we will
  • Often, maintainers respond with why for several back and forths; rubber duck debugging might help avoid that
  • Folks posting issues sometimes fall for xy problems: asking for a certain solution instead of raising the root problem

Thanks,
— bb

The exact same thing can be said about JavaScript minifiers preserving such comments.

My point is that it’s completely different: js/css minifiers have this option because OSS licenses (MIT etc) say that you must include the license in the code. So you import js/css from node_modules and get that comment in your bundle and the bundler needs to keep it. There is a special comment syntax for “license” comments so those bundlers know whether to keep it or not.
Markup as found in HTML does not have that. It doesn’t exist in node_modules. There is no MIT license on the tags. You can’t import some HTML into other HTML.
No, the words have licenses, creative commons and such. And thus you need to show that copyright + license visually next to the words.

I’m not judging, only observing.

I would love to see it because it seems so weird to me. 😅

@wooorm – thanks for the mention, I forgot to answer. Anyway:

If people can write HTML comments and leak things, they could leak things in any other way: they could leak outside of comments.
It seems like security is your main concern, but I’m not sure security is helped much by this? 🤔
The main reason this exists is for performance instead: so that the HTML sent over the wire is smaller.

Nope, security isn't the main thing here (primarily we use rehype for minification, though), but to me, removing all the unwanted comments (in HTML, but also in JS, or CSS) is not only about the performance, but also cleaning up possible comments containing e.g. business domain's knowledge – and sometimes particular types of comments should stay on prod, e.g. in minified JS there are the comments about author or licenses. But you're right – at the end of the day we (as developers) are the ones responsible for what's pushed to prod, so it's not my main reason as you wrote, but rather a nice-to-have tool which could ensure us that we don't have to remember about possible leaks from comments.
So it'd be great to keep rehype-minify as a helpful utility, but with ability to preserve some comments on demand, which could be useful not only for our case.

Wouldn’t it make sense to use rehype to minify and remove all comments after nginx commands?
Or, to disallow them, as there is already a good alternative: if people want to do such things, they could use Astro / MDX?
rehype-minify is a minifier, it makes things ugly, I’d expect it to interfere with nginx.

Minification happens during the build of the "dist", and the nginx commands are used to inject some, let's say organisation-wide, HTML fragments (replacing the HTML comments which I'd like to preserve) before serving the already built page to the client.
Of course, some HTML fragments could be Astro components instead of SSI includes, but not each and every of them, or at least not at the very moment (different repositories in different technologies, not-yet migrated stuff etc.).

OK, I think a PR for test(value: string) is acceptable!

Hi! This was marked as ready to be worked on! Note that while this is ready to be worked on, nothing is said about priority: it may take a while for this to be solved.

Is this something you can and want to work on?

Team: please use the area/* (to describe the scope of the change), platform/* (if this is related to a specific one), and semver/* and type/* labels to annotate this. If this is first-timers friendly, add good first issue and if this could use help, add help wanted.

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.