whatwg/html-build

Add MDN annotations

Closed this issue · 5 comments

For each feature/section of the spec that has a corresponding MDN article, we should have the build add annotations in the spec output, linking from the spec section to the corresponding MDN article.

The MDN annotations would serve a purpose similar to the caniuse.com annotations the spec already contains; for example, at https://html.spec.whatwg.org/multipage/history.html#dom-history-pushstate


caniuse.com annotation


…but the MDN annotations could lighter-weight — e.g., something like https://es5.github.io/#x15.1.3.4


ES5 spec annotation


…where in that 15.1.3.4 encodeURIComponent (uriComponent) # Ⓣ Ⓡ heading, the is a link to https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/encodeURIComponent

Getting the data

MDN articles on features in the HTML spec generally contain a Specifications section which has a link back to the corresponding HTML spec section.

For example, the MDN article on the History interface has this Specifications section:

https://developer.mozilla.org/en-US/docs/Web/API/History#Specifications

…which contains the following table:

Specification Status Comment
HTML Living Standard
The definition of 'History' in that specification.
Living Standard Adds the scrollRestoration attribute.
HTML5
The definition of 'History' in that specification.
Recommendation Initial definition.
Custom Scroll Restoration - History-based API
The definition of 'History' in that specification.
Editor's Draft Adds the scrollRestoration attribute.

…where https://html.spec.whatwg.org/multipage/browsers.html#the-history-interface is the URL for the “HTML Living Standard” hypertext.

So for the purpose of the HTML spec build we’d need to have some kind of data file (similar to https://raw.githubusercontent.com/Fyrd/caniuse/master/data.json) against whose contents we could have wattsi check each spec-section ID to see if there’s a corresponding MDN article which has a Specifications section containing a link back to that HTML spec section ID.

But as far as I know, MDN currently doesn’t provide any kind of API that would allow us to get a data file with the metadata we’d need.

We can get some metadata for any individual MDN article by appending $json to the article URL:

https://developer.mozilla.org/en-US/docs/Web/API/History$json

… but even that individual-page metadata doesn’t (yet) contain the Specifications metadata we need.

The only other relevant mechanism I’m aware of that MDN does provide is a tarball of the entire contents of MDN https://developer.mozilla.org/en-US/docs/MDN/About#Downloading_content — a 2GB+ download that when untarred and unzipped isn’t a single data file but instead is a couple of hundred thousand individual HTML files totaling 14GB.

So to get what we need for the purposes of the HTML spec build, some kind of API would first need to be added to MDN to expose the Specifications metadata in a format we could work with practically.

@sideshowbarker I've also silently wanted MDN to expose a list of specs somehow. Can you file an issue on https://github.com/mdn/mdn for what we need here? (https://github.com/mdn/browser-compat-data is something similar, but not about specs, but maybe it could be.)

@sideshowbarker I've also silently wanted MDN to expose a list of specs somehow.

You know about https://github.com/mdn/kumascript/blob/master/macros/SpecData.json? That’s a complete list of specs that are recognized by the MDN/kumascript SpecName and Spec2 macros.

Can you file an issue on https://github.com/mdn/mdn for what we need here?

Yes, but in the meantime I’m also working on an html-build patch to generate the data (by crawling MDN with wget to get the relevant pages, and then building the data from those).

https://github.com/mdn/browser-compat-data contains compat data for web features along with an mdn_url. In mdn/browser-compat-data#1531 we are also thinking about maybe adding another property called spec_urls to this data set to allow mapping between mdn <-> compat data <-> specifications. We haven't started this work, though. Additional ideas or comments on why this would be a good idea would be appreciated in that issue.

I've also silently wanted MDN to expose a list of specs somehow. Can you file an issue on https://github.com/mdn/mdn for what we need here? (https://github.com/mdn/browser-compat-data is something similar, but not about specs, but maybe it could be.)

https://github.com/mdn/browser-compat-data contains compat data for web features along with an mdn_url. In mdn/browser-compat-data#1531 we are also thinking about maybe adding another property called spec_urls to this data set to allow mapping between mdn <-> compat data <-> specifications. We haven't started this work, though. Additional ideas or comments on why this would be a good idea would be appreciated in that issue.

So yeah the addition of that spec_urls property to BCD would be very useful for us here in generating the MDN annotations. Specifically, it’d significantly reduce the time and bandwidth needed for the part of patch #184 that builds the data we need by scraping out content from MDN itself.

However, even with the addition of the spec_urls property, we’d still need to scrape data of out MDN. That’s because the #184 patch also makes use of the MDN article (seo)summaries and also the MDN article titles.

But I’ve noticed that the mdncomp tool https://github.com/epistemex/mdncomp makes use of a sort of enhanced version of the BCD data, maintained at https://github.com/epistemex/mdncomp-data, which includes not just the spec URLs but also summary descriptions.

I assume @epistemex is generating/compiling that enhanced BCD data by regularly (weekly I think) running some kind of scraping of MDN to get it and build it into the data.

So @Elchi3, given the above and given that per https://github.com/mdn/browser-compat-data#browser-compatibility-tables-on-mdn there’s already a (Kumascript) build step that’s run regularly (every 4-14 days) to deploy the BCD data — that makes me wonder whether MDN itself could eventually expose a similar compiled enhanced form of the BCD data that adds the MDN summary descriptions and article titles (pulled from MDN itself through a build step).

Anyway, this whatwg/html-build issue isn’t the right place to discuss that idea further, so I guess I’ll take discussion of that to https://discourse.mozilla.org/t/proposal-have-mdn-api-provide-data-for-all-specs-linked-to-in-specifications-tables/31768 or https://bugzilla.mozilla.org/show_bug.cgi?id=1491194

But in the meantime, the https://github.com/epistemex/mdncomp-data data provides it (mostly — it’d also need to provide the article titles https://gitlab.com/epistemex/mdncomp/issues/2), so I think I’ll be able to change the #184 patch to use that for now.