w3c/webref

Publish list of known fragment identifiers

foolip opened this issue · 6 comments

This would be helpful for web-platform-dx/web-features#84, to be able to create a spec URL validator that checks if a URL like https://w3c.github.io/webrtc-pc/#dom-datachannel-binarytype is a good spec URL.

A similar problem is solved in Bikeshed by downloading the data directly from GitHub:
https://github.com/speced/bikeshed/blob/584813e6380533a19c6656594c810bf974854e68/bikeshed/update/updateCrossRefs.py#L236

For something that should go into a CI check, that's not good though, since the build could break at any time.

This would be helpful for web-platform-dx/web-features#84, to be able to create a spec URL validator that checks if a URL like https://w3c.github.io/webrtc-pc/#dom-datachannel-binarytype is a good spec URL.

The list of fragment identifiers appears in the ids extracts. For example, the URL you suggest as example appears in the WebRTC id extract.

For something that should go into a CI check, that's not good though, since the build could break at any time.

We could create an NPM package but I'm wondering how that would solve "could break at any time". Could you clarify?

If we go ahead with a package, I wonder about the frequency of releases and about guarantees. We don't do any data curation on fragment identifiers (and if we could avoid doing additional curation, I think we wouldn't mind ;)). We could automate the publication of the package but the list of fragment identifiers changes frequently. Should we publish a package one or more times per day? Or should we restrict publications to, say, once per week?

These are good questions. The important part to avoid sudden breakage of CI is that the IDs are pinned in some way. An NPM package makes that easy and allows depending on Dependabot. But it can also be done by pointing to a specific webref commit, perhaps using it as a submodule.

The release cadence is a good question. I guess roughly weekly would be OK. And I agree that it would be fantastic to not have to review changes to identifiers at all or make many guarantees, just expose the same stuff that Bikeshed uses.

This isn't urgent at all BTW, it's a nice-to-have.

It suddenly occurs to me that looking at the full list of fragment identifiers is probably not a good idea in any case: the "pinning" mechanism you describe is also the sort of stability that specs need when they reference some other spec. This is what led to exported definitions. Ideally, features would only link to exported definitions... and likely section headings. In any case, links to internal definitions and other IDs should be discouraged.

The data's already in Webref too, in dfns and headings extracts.

We have tools in place that detect broken links (w3c/strudy) from Webref data and report them automatically. We could also detect changes earlier on in Webref. In the end, we could perhaps create a package that contains stable fragment identifiers (exported definitions and section ids), and use some semver logic to report breaking changes:

  • patch increment: new fragment identifiers added
  • minor increment: some fragment identifiers disappeared
  • major increment: major data structure change
    (or major increment for any fragment identifier change)

Good point about not all IDs being good feature links, I didn't even consider other linkable things like examples and whatnot.

I strongly suspect that doing this will reveal lots of things that aren't exported but should be, and that it will be a bit of a slog.

But I like the approach!

I was just looking if something like this exists! :)

My use case: In mdn/browser-compat-data, we'd like to remove status.standards_track (mdn/browser-compat-data#1531) and only refer to spec_urls. I think in order for this to happen it would be good if there was a BCD linter that checked if all of our spec_urls are actually valid, including their fragment id.

My use case: In mdn/browser-compat-data, we'd like to remove status.standards_track (mdn/browser-compat-data#1531) and only refer to spec_urls. I think in order for this to happen it would be good if there was a BCD linter that checked if all of our spec_urls are actually valid, including their fragment id.

Are you looking for an actual NPM package? Or are you more looking for a way to validate URLs with fragments, which could live in BCD?

I'm asking the question because, per the discussion above, the data that is needed is already in Webref. You may validate URLs with fragments in one of two ways:

  1. If you're not too worried about stability of the IDs and just want to know whether the ID exists, you could look at the IDs extracts
  2. If you'd like to enforce some sort of stability, you might want to restrict to terms that specs actually export. For that, you could look at the dfns extracts (possibly filtering on definitions that have an "access": "public" property, and at the headings extracts for links to sections. The dfns and ids extracts contain additional information about the fragment, which might perhaps prove useful later on in MDN as well, e.g., to label the links?

An NPM package would provide some pinning ability, but a side effect of that pinning is that it also means the data will often be somewhat outdated: the dfns data gets updated every 6 hours but it does not make a lot of sense to publish an NPM package that frequently. The other NPM packages for Webref also contain somewhat outdated data, of course, but the content is the result of data curation and manual review, performed once in a while.

For the problem at hand, there's no good reason to choose a particular commit to pin the data. Perhaps what we need is an NPM package that only contains a validateUrl function that retrieves the latest data from Webref by default and can take a commit ID as parameter to retrieve Webref data at that particular commit if you need the function to return a stable result?