whatwg/html-build

Add built-time syntax highlighting

domenic opened this issue · 12 comments

In whatwg/html#2751 @sideshowbarker proposes adding client-side HTML syntax highlighting. We may want to merge that sooner instead of blocking on what I propose below. But the below proposal avoids some of the issues there and has some other benefits, so we should do it eventually.

The proposal is to have @tabatkins extract his syntax highlighter from Bikeshed and then the html-build process and/or wattsi can shell out to it. The exact shape of this is TBD, see below.

Bikeshed's syntax highlighter consists of:

  • Pygments as the base
  • Support for highlighting even code that has interspersed markup, which we use a decent amount in HTML---such as <mark>, <ins>, <del>, or <a>
  • Web IDL syntax highlighting, as that is not a Pygments-supported language
  • Line numbering/highlighting (not relevant to us)

The benefits of this over the client-side solution are:

  • No potential startup jank for users
  • Consistency with other WHATWG specs (which use Bikeshed directly)
  • Allows interspersed markup as described above
  • Web IDL syntax highlighting

Also, I think we'd want to have this easily disabled during the build process, to get faster local builds. For deploys/in CI we would enable it of course.

This would probably all work best if we can shell out to a script extracted from Bikeshed. It would presumably written in Python, Bikeshed/Pygments's language. There are a few possibilities for the overall workflow:

  1. Preprocess the spec before feeding it to wattsi; the syntax highlighter is responsible for finding all code blocks
    • Probably won't work: Wattsi input source is not real HTML
  2. Postprocess each page of the the spec after building it; the syntax highlighter is responsible for finding all code blocks
    • Probably will work, although a second pass might be slow
    • Might be more work for @tabatkins
  3. Shell out each code fragment to be highlighted to the syntax highlighter tool
    • Would require Wattsi integration, not html-build integration
    • Would require a format for passing the data; @tabatkins prefers a [tagname, {attrs}, ...contents]-style tree instead of HTML, I believe so that he then doesn't have to include a HTML parser

After writing this, I am leaning toward (2) right now, although that didn't align with @tabatkins's thoughts in IRC (he was thinking more along the lines of (3)), so I am curious what the right approach is.

Yeah, the problem with option 2 is that I'll need to parse the HTML to extract the code blocks; that means including LXML and html5lib in the bundle dependencies, making the whole thing more complicated. Parsing HTML is also by far the slowest part of Bikeshed; I've avoided as much of it as possible outside the initial whole-document parse (which is still ~1/3 of Bikeshed's entire runtime on typical specs).

If Wattsi already has a DOM available to it, option 3 is far simpler and faster.

Alternately, option 2 paired with your own HTML parser would be fine too - run over it and extract what you need, convert to JSON, pipe to my library, then fill the result back in yourself.

This is friendlier to adding more future external tools, too, if you really don't want to extend Wattsi itself.

Yeah, the problem with option 2 is that I'll need to parse the HTML to extract the code blocks; that means including LXML and html5lib in the bundle dependencies, making the whole thing more complicated.

Yeah that sounds undesirable if it can be avoided

Parsing HTML is also by far the slowest part of Bikeshed

Yeah if we add Python-based parsing of the HTML of the spec to the build, that seems like it’s going to result in a much longer build time than what we have now.

Right now on my machine with just the wattsi-based we currently have, it takes me 11 seconds to build all the output for the spec—including all the multipage processing and other output features.

The wattsi HTML parser is extremely fast. I think it might be the fastest conforming HTML parser available anywhere.

If Wattsi already has a DOM available to it, option 3 is far simpler and faster.

I don’t know that what we have with wattsi is exposed in a way that would allow integration with the syntax-highlight processing. I suppose it’s possible but don’t have a clear idea myself at this point of how we could actually do it

Alternately, option 2 paired with your own HTML parser would be fine too - run over it and extract what you need, convert to JSON, pipe to my library, then fill the result back in yourself.

That sounds promising but if we were to do that using the HTML parser from the wattsi sources, we’d need somebody to write the (FreePascal) code for an application that does what we need.

I don’t know that what we have with wattsi is exposed in a way that would allow integration with the syntax-highlight processing. I suppose it’s possible but don’t have a clear idea myself at this point of how we could actually do it

I assume ObjectPascal can shell out - if so, you can put in some code that looks for pres (or hooks into existing code that does), then do the convert to JSON->shell to my code->convert back to HTML dance.

The wattsi HTML parser is extremely fast. I think it might be the fastest conforming HTML parser available anywhere.

I wonder if I could do a reasonable conversion. I've already got a patch that almost completely removes selector usage from Bikeshed (as it's another big component of the processing time), so it wouldn't be too hard to swap over to a new parser/treelib that doesn't have Selectors support.

I assume ObjectPascal can shell out - if so, you can put in some code that looks for pres (or hooks into existing code that does), then do the convert to JSON->shell to my code->convert back to HTML dance.

Yeah that sounds feasible. I’ve never done that in Pascal but then pretty much ever wattsi change I’ve worked on has involved needing to learn how to do some thing I hadn’t needed to do yet in any previous changes.

I wonder if I could do a reasonable conversion. I've already got a patch that almost completely removes selector usage from Bikeshed (as it's another big component of the processing time), so it wouldn't be too hard to swap over to a new parser/treelib that doesn't have Selectors support.

Not sure what you mean by that part… It sounds like something different than what you wrote earlier about just shelling out to your code. It sound like what you mean is sort of the opposite of the shelling-out-from-wattsi-to-python you described in your other comment—that is instead, calling that wattsi parsing code from within your code

Not sure what you mean by that part… It sounds like something different than what you wrote earlier about just shelling out to your code.

Yeah, it's separate, don't worry about it. It's just that HTML parsing is a big chunk of Bikeshed's runtime that I can't reduce, and having it be faster would be nice. ^_^

I’ve added some handling to wattsi for serializing the contents of pre elements to JSON.

As far as the structure of the JSON it generates: As an example, given this HTML source:

  <pre class=idl>[Exposed=Window,
   <a href=#htmlconstructor id=the-p-element:htmlconstructor>HTMLConstructor</a>]
  interface <dfn id=htmlparagraphelement>HTMLParagraphElement</dfn> : <a href=#htmlelement id=the-p-element:htmlelement>HTMLElement</a> {};</pre>

…wattsi will serialize that to the following JSON:

  [
    {
      "class": "idl"
    },
    "[Exposed=Window,\n ",
    [
      "a",
      {
        "href": "#htmlconstructor",
        "id": "the-p-element:htmlconstructor"
      },
      "HTMLConstructor"
    ],
    "]\ninterface ",
    [
      "dfn",
      {
        "id": "htmlparagraphelement"
      },
      "HTMLParagraphElement"
    ],
    " : ",
    [
      "a",
      {
        "href": "#htmlelement",
        "id": "the-p-element:htmlelement"
      },
      "HTMLElement"
    ],
    " {};"
  ]

Each pre element is represented as an array with the following as items in the array:

  • For attributes, an object with the attribute names as keys.
  • For each text node, a string.
  • For each child element, an array with the element name as a string as the first item, followed by the attributes as an object as above, and each text node as a string, and any descendant elements as arrays (but not sure there are actually cases of element nesting in pre going that deep).

For other examples, you can see https://gist.githubusercontent.com/sideshowbarker/8284404/raw/65c0d3e2aa4ef2b35b8246a9f7d4fc5bb6045cf2/html-spec-all-pre-elements.json, which has JSON output for all 1106 pre elements in the spec, in the same order as they appear in the spec.

The first element of the outermost array, before the attribute object, should be "pre", right?

The first element of the outermost array, before the attribute object, should be "pre", right?

Yes. But since it can always be assumed to be "pre", I intentionally had the wattsi serializer not emit it.

Do you want to me to instead make the serializer explicitly include it each time?

I guess that’d make the parsing code for it easier to write.

I guess that’d make the parsing code for it easier to write.

Emitting the "pre"s also makes the wattsi code less complicated, so I went ahead and changed it.

https://gist.githubusercontent.com/sideshowbarker/8284404/raw/cff69f158ea995a17a73af3e9eff29823617caa8/html-spec-all-pre-elements.json has the updated output with the "pre"s added.

This has been done for a while, thanks to heroic work from @sideshowbarker and @tabatkins :).