emgarten/NuGet.CatalogReader

[Suggestion] Filtering packages by profile (uploader)

Closed this issue · 4 comments

For example, if one would want to resolve a dependency hell issue with all version of .NET Core (1.0, 1.1, 2.0) and all the included templates (mvc, console, xunit etc...),

he would maybe look for a way to download all versions of all packages uploaded by specific official organizations, like Microsoft, aspnet, EntityFramework, dotnetframework, instead of mirroring the whole 990k nuget package repository (which is also full of junk packages, with no cleanup in sight)

I would attempt to dive into the code in a fork and attempt to implement such feature, but I do not have the time currently, so for now, just wanting to open a discussion on the matter..

I like the idea, I'm not sure what the best way to get that information is.

To find out the actual owner that uploaded the package the search service would need to be used.

Authors, which are not verified since it is just the field in the nuspec, is available through registration blobs. The catalog reader could find all ids, then read all registration pages to find the authors.

It would be helpful if there was a way to mirror only the good packages.

@joelverhagen do you have any thoughts on this?

"It would be helpful if there was a way to mirror only the good packages." That would be great, and that what I'd very much like to do.. (mirroring the entire nuget repo has now become unfeasible due to the size and amount of packages), but as a milestone getting all packages by author, much like getting all versions of a single package is a great stepping stone..

This is a great idea.

As you guys have mentioned, there are some challenges around programmatically discovering a package's owner (or owners, since a package ID can be owned by multiple usernames). On the bright side, ownership is at the package ID level. This means that the ownership "map" does not need to take package version into account and is therefore much smaller than the list of all packages.

Another challenge is that ownership information does not flow through the catalog at all. It is a bit of data that is only merged into the corpus of packages in the NuGet.org search service which, as far as I know, is not queried by NuGet.CatalogReader at all today.

If we want to stick to "official" APIs, we can only query for packages owned by an arbitrary user and not the reverse (which I think is just fine in this scenario):
https://api-v2v3search-0.nuget.org/query?q=owner:nuget&semVerLevel=2.0.0&prerelease=true

Use q=owner:OWNER to search for packages owned by a specific user. semVerLevel=2.0.0 means we don't want to filter out SemVer 2.0.0 packages. prerelease=true means we don't want to filter out packages that only have prerelease versions. Use skip and take to page through results. take defaults to 20 and has a maximum value of 1000.

This query could be used by NuGet.CatalogReader to enumerate all package IDs owned by a user that you care about (e.g. microsoft). It would be straightforward to build out an IDictionary<string, string[]> at the beginning of runtime that maps an owner to a list owned package IDs. This could be used during the catalog reading to filter out packages that we don't care about.

One caveat is that ownership can change for existing packages. That means that a package that you have already excluded may become interesting again even after your catalog cursor has passed it.

Alternatively, we could implement a different mode that does not use the catalog at all and simply goes straight to flat container after discovering an owner's list of package IDs. This mode could be augmented by transitively fetching each of the package's dependencies. For example, if you only care about packages owned by microsoft and aspnet, you probably still want Newtonsoft.Json owned by jamesnk.

"if you only care about packages owned by microsoft and aspnet, you probably still want Newtonsoft.Json owned by jamesnk", yes, indeed that is also correct, it's just probably less sensitive about versions, so I was under the assumption that all dependencies have been resolved at least once, so Newtonsoft.Json would already be present, but in retrospect it would be better to implement a sort of recursive download in this 'different mode' where is package1 made by author X, and we wanted all of author X's packages, and package1 is dependent on author Y's package2, it better to 'trust' all of Y's packages and download them all, creating a somewhat 'trusted offline nuget repo' based on one or more known authors and the authors or their dependencies...