
Cut down on build dependencies (data files and binaries)

Closed this issue · 12 comments

A few things are currently less than ideal:

  • A lot of data needs to be downloaded, and it's all thrown away when "Build tools have been updated since last run; clearing the cache."
  • .entity-proccessor.py says "this uses 658 MB and in fact I cannot run it on my VPS.
  • Users need to install Subversion and Perl's XML::Parser, which are likely not pre-installed.
  • (minor) We don't track dependencies, so builds are not reproducible.

Wouldn't it be nice if building were just blazing fast by default, and rebuilding dependencies was an option that should rarely be used?

It looks like the files that are eventually used are only these 6:

  • caniuse.json
  • cldr.inc
  • entities-dtd.url
  • entities.inc
  • entities.json
  • w3cbugs.csv

Together they are only 1.7 MB, or 282 kB gzipped. That's a lot of room for saving.

Rough proposal:

  • Separate out the scripts for building these dependencies so that they can easily be built without also building the spec.
  • Set up an automatically updated html-build-deps repo that has the output.
  • In build.sh, by default use the html-build-deps repo, but have an option to generate from scratch.
  • (maybe) Track the exact html-build-deps commit to use, using either submodules or a DEPS file.

Related issues:
#60 (would be made obsolete)

I agree with the motivation behind this plan and many of the particulars. Here are some nits:

  • I am not sure that caniuse.json and w3cbugs.csv are in the same category as the others. But, I guess seeing explicit changes to them over time does have value, so maybe they are.
  • I think a third repo is probably not necessary. I'd be fine with having them either here or in whatwg/html. Probably in both cases they'd be under a subdirectory ("support"?) with a README.md inside explaining how these files are generated and updated, and that they should not be edited manually. This avoids any submodules work.
  • This is optional, but I think the ideal workflow would involve a GitHub helper bot that automatically sends pull requests rolling the dependencies when they change. This seems kind of subtle though, e.g. who knows how often they would change (would every other commit become a "rolling deps" commit?) and what would the bot do if we haven't accepted its PRs by the time it needs to post a new one. Maybe too much work.

Upon reflecting on the last two points combined: I think the key question is how often these things change. If they change very frequently, they should probably be in a separate repo, and the build script should always use master; that removes any "rolling deps" commit noise. If they change infrequently (say at most 1/week, preferably less) then keeping them in whatwg/html seems reasonable. I guess keeping them in whatwg/html-build might work even with frequent deps...

Hmm. Maybe your original plan of just a separate repo that gets automatically updated with no human intervention is simplest :).

Right, it matters how often these things change. Some details on each dependency:

The biggest win would clearly be from the cldr and entites dependencies, pre-building their outputs. Those would rarely change, so if we stop there we could just keep the output in the html or html-build repo.

I don't think we should bother with caniuse.json and w3cbugs.csv. Both are clearly useful. Use -n to avoid a download.

I think we should stop doing unicode.xml by default. If unicode.xml changes, an issue needs to be filed and we need to get agreement on adding a new entity to HTML and XML. That should not happen automatically without anyone noticing.

cldr.inc is used in the "Quotes" subsection of "Rendering". That is it. I don't think that's actually implemented currently so it might be a candidate for removal, though I think implementers might want to get to it eventually for proper quotation rendering? Also, that section says

User agents are expected to use either the block below (which will be regularly updated) or to automatically generate their own copy directly from the source material. The language codes are derived from the CLDR file names. The quotes are derived from the delimiter blocks, with fallback handled as specified in the CLDR documentation.

so we could just provide that algorithm instead and drop this dependency. User agents want that algorithm too since otherwise they would have to scrape this or write such an algorithm based on their own CLDR copy.

Test for quotes: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/3865

Output per spec should be:
English: “In Gothenburg people say ‘gôtt’ all day long.”
Svenska: ”I Göteborg säger folk ’gôtt’ mest hela dagen.”

No browser matches that. Chrome, Safari and Edge get it all right except for the Cantonese (粵文) where Chrome and Safari fall back to ASCII quotes and Edge uses English-style quotes. Firefox uses English-style quotes everywhere.

Looks like enough browsers are trying to implement this that we shouldn't just drop it, but having a moving target like this is annoying.

So I wonder if we could define the quotes stuff in a way that matches what https://tc39.github.io/ecma402/ does. Recommend implementations to use the locale data from that repository, but allow them use their own mapping. And simply define in prose what that would look like.

With #63 looking good, CLDR is the biggest remaining offender. I'll take a look, using a similar approach.

I would still prefer fixing that by defining an algorithm in the HTML standard and dropping all the build code in the process. I'd be happy to give that a shot. Or does Chromium actually implement this through copy-and-pasting the stylesheet?

Okay, I'm on board with the quotes/ and entities/ approach. That way we also know when either changes. I guess now we're waiting for @domenic to wake up.

I just wanted to say after looking over the PRs that I love the readmes in the subdirectories. Very nice.

I'd consider this fixed now, with #63 and #66 merged. The caniuse dependency isn't one we want to get rid of, and burning down the list of Bugzilla bugs until we can remove that dependency isn't really worth tracking with this bug, it's rather something we need to do in html.

Filed whatwg/html#619 to burn down the list of bugs that come in via w3cbugs.csv.