purescript/registry-dev

Determine a specific legacy date cutoff or require all packages to compile

thomashoneyman opened this issue · 7 comments

@f-f @MonoidMusician

When we publish a package today, we specify whether it is a Legacy or Current package. This source is quite old, from our first cut at the importer scripts, and its scope has diminished significantly over time. Nowadays it solely controls whether to fail the publishing pipeline if a package fails to compile. Legacy packages can fail compilation and current packages cannot.

Today we use Legacy when using the legacy importer and Current when using GitHub issues or the legacy importer in upda-te-registry mode. This is a problem because if a package fails to compile and is rejected by GitHub issues / update-registry mode, and then we run the legacy importer locally, the package will be registered despite failing to compile.

I think that we should either a) change the Legacy vs. Current distinction to be based on package publication date, where legacy packages are before a certain cutoff date, or b) remove the Legacy vs. Current distinction altogether, requiring all packages in the registry to compile (including legacy packages).

My preference is to remove the distinction altogether, for the same reason why we decided to remove packages from the registry if they don't solve: you should be able to install and build any package from the registry.

Option 1: Legacy Cutoff Date

If we choose to retain the legacy vs. current distinction, then I think we should choose a specific date cutoff after which a package is not considered 'legacy.' I would say that September 1, 2022 makes sense as a cutoff, since that's when we "launched" the registry and deprecated the old package sets. If the package was tagged after this date then it's considered a current package.

We know the package publication date because it's used in the package metadata; we determine this when fetching the package source from GitHub. We may not be able to determine the publication date for non-Git packages when we support more Locations in the future, but for those we can just assume time of registration as the publication date.

Upside: Easy to implement.
Downside: We still have special-casing for 'legacy' vs. 'current' packages in the pipeline. Legacy packages may be broken (they do not compile).

Option 2: All Packages Compile

Alternately, we can remove the legacy vs. current distinction altogether if we choose to only allow packages in the registry if they compile. We have already gone back through the registry to enforce that all packages solve; this would be upping the ante by also enforcing they compile.

This feels like the ideal solution, but it's made difficult because when we are using the legacy importer we must identify a specific compiler to use before hitting the publish pipeline. Identifying a compiler can be easy or difficult, depending on the circumstances.

I believe these heuristics will allow us to reliably choose a compiler to use to compile packages imported from Bower & Spago:

  • If the package uses Spago then we can identify the compiler version from the package set in use.
  • If the package uses Bower and has no dependencies, then we can use the package publication date to infer what compiler would have been active around that time; few packages exist that have no dependencies, so to make this more robust we could try the most recent compiler at that time, and then if that fails try a prior version, too.
  • If the package uses Bower and has dependencies, and we've already done #255, then we can take the intersection of compilers supported by its dependencies, and then choose perhaps the most recent compiler from that range.

We can then use the selected compiler version to run the publish pipeline as usual.

Upside: Every package in the registry is known to solve & compile. There is no distinction between a "legacy" and "non-legacy" package, and there is no special-cased code.
Downside: Our heuristics may be wrong and we incorrectly delete packages from the registry. More complicated to implement.

I vote for Option 1. First of all, I really don't want to spend time on that, and I don't really want you to spend time on that either. I don't think it would get done in, say, a month and it's more important to get something out the door. Also I think that we're not going to care about Legacy packages in a few months, especially after the next major compiler release whenever that happens. And I'm a bit antsy about guessing compiler versions, especially if that means we'd be removing stuff based on that information.

f-f commented

A few thoughts on this - the distiction of "legacy package" vs "new package" is not really about any of the above - which I consider implementation details, as it's literally about which code pipeline a package is going through - but really it is about "are we grandfathering this package in?".
This ultimately boils down to "does this package have a purs.json file?".
In this light, all packages are legacy packages right now!

The above definition is important, because it denotes the scope of our actions as the Packaging Team. We are allowed to import "legacy" packages ourselves, as package authors can't retroactively change already published packages. Once an author adds a purs.json to their package then we should stop having any responsibility on it, and we should not auto-import it. This also denotes the line between "packages that we can republish" and "packages for which the registry is the only source of truth".

Going back to the original issue, it seems that the problem is the fact that

Today we use [..] Current when using GitHub issues or the legacy importer in update-registry mode

How does this make sense? If a package is going through the legacy importer then it cannot be Current at all.

I am ok with requiring all the legacy packages to build, but if we go to the effort of doing that then we should do a proper job. Having a "cutoff date" is arbitrary, and does not fix the problem either, so Option 2 is really the only way that makes sense to me if we want to go forward with this.
If we want to have a cutoff date of sorts then something that makes more sense to me is a general cutoff on the version of purs, i.e. if a package does not compile with something lower than purs-0.13.0 then it's out. The ecosystem has moved a lot anyways, so the vast majority of people that cares about the registry will not be bothered by this.

Something that helps identifying the compiler version is to infer it from the version of prelude, they usually go hand in hand.

A few thoughts on this - the distiction of "legacy package" vs "new package" is not really about any of the above - which I consider implementation details, as it's literally about which code pipeline a package is going through - but really it is about "are we grandfathering this package in?". This ultimately boils down to "does this package have a purs.json file?". In this light, all packages are legacy packages right now!

Yes, in registry terminology, a legacy package is one that does not have a purs.json and a regular/normal/new package is one which does. However, in light of #435, I suggest that a new package is one with a supported manifest format, ie. purs.json or spago.yaml. I expect many packages in the future will not have a purs.json file because they have a spago.yaml file. Put differently: do we have to use Legacy.Manifest to get a manifest for this package?

Going back to the original issue, it seems that the problem is the fact that

Today we use [..] Current when using GitHub issues or the legacy importer in update-registry mode

How does this make sense? If a package is going through the legacy importer then it cannot be Current at all.

Legacy vs. Current isn't a distinction between "legacy" packages and "current" packages, as you noted, it's really a distinction about what level of restrictions we impose on the legacy package. A very old package like prelude@1.0.0 is allowed to fail compilation, and a relatively new package like prelude@6.0.0 is not.

That's because we would very much like for all packages in the registry to compile, but when the process is not via a GitHub issue we can only guess at the correct compiler version, and we have no heuristics (yet) that let us reliably make that guess.

When we go through GitHub issues it's still a legacy package, as you noted, it's just that we know what compiler version to use with the package. The legacy importer is in an odd state; it assumes that the update-registry mode is intended to be used in the daily GitHub CI, and so it assumes that packages published today are almost certainly going to be usable with the 0.15.x series of the PureScript compiler. So it uses Current. This has been the case since it was first launched.

This takes us back to the Legacy vs. Current distinction, which really ought to be either data Compile = Required | Optional, with some rule about when we require compilation to succeed, or be removed altogether if we can always confidently guess the compiler version to use.

If, as @MonoidMusician suggested, we don't want to put in the work to get the whole registry to compile, then choosing a compiler version cutoff like 0.13.0 where we require it only after that version makes sense to me — but we still have to be able to reliably guess what version the package uses.

In the registry call @f-f reminded me that the https://github.com/purescript/pursuit-backups repository includes the compiler version used to publish to Pursuit, so that may be a reliable way to determine a valid compiler version. See, for example, the very end of this JSON:

https://github.com/purescript/pursuit-backups/blob/master/purescript-abc-parser/1.8.0.json

We've had a discussion in core and I believe we have a consensus to apply a 0.13.0 cutoff.

We have a consensus to:

  1. Drop package versions prior to PureScript 0.13.0
  2. Drop packages altogether if they have no versions from 0.13.0 onward, freeing their names.

Starting up on this one!

Edit: This is a bit tricky to implement after all, as it relates to #255, so I'm back to being paused for the time being.