Expose the METADATA file of wheels in the simple API

Question

Expose the METADATA file of wheels in the simple API

dstufft opened this issue 5 years ago · 124 comments

Currently a number of projects are trying to work around the fact that in order to resolve dependencies in Python you have to download the entire wheel in order to read the metadata. I am aware of two current strategies for working around this, one is the attempt to use the PyPI JSON API (which isn't a good solution because it's non standard, the data model is wrong, and it's not going to be secured by TUF) and the other is attempting to use range requests to fetch only the METADATA file from the wheel before downloading the entire wheel (which isn't a good solution because TUF can currently only verify entire files, and it depends on the server supporting range requests, which not every mirror is going to support).

It seems to me like we could side step this issue by simply having PyPI extract the METADATA file of a wheel as part of the upload process, and storing that alongside the wheel itself. Within TUF we can ensure that these files have not been tampered with, by simply storing it as another TUF secured target. Resolvers could then download just the metadata file for a wheel they're considering as a candidate, instead of having to download the entire wheel.

This is a pretty small delta over what already exists, so it's more likely we're going to get it done than any of the broader proposals of trying to design an entire, brand new repository API or by ALSO retrofitting the JSON API inside of TUF.

The main problems with it is that the METADATA file might also be larger than needed since it contains the entire long description of the wheel and that it still leaves sdists unsolved (but they're not currently really solvable). I don't think either problem is too drastic though.

What do folks thinks? This would probably require a PEP and I probably don't have the spare cycles to do that right now, but I wanted to get the idea written down incase someone else felt like picking it up.

@pypa/pip-committers @pypa/pipenv-committers @sdispater (not sure who else work on poetry, feel free to CC more folks in).

Answer 1 · 2020-07-13T18:08:12.000Z

Sounds like a good idea. It would probably need to be an optional feature of the API, as we need to keep the spec backward-compatible, and really basic "serve a directory over HTTP" indexes might not be able to support the API.

But that is a minor point. Basically +1 from me.

One pip-specific note on timing, though. It looks like pip will get range request based metadata extraction before this API gets formalised/implemented. That's fine, but I think that when this does become available, pip should drop the range approach and switch to just using this. That would be a performance regression for indexes that support range requests but not the new API, but IMO that's more acceptable than carrying the support cost for having both approaches.

Answer 2 · 2020-07-13T18:21:45.000Z

I agree, it seems like a reasonable solution. If we design how the metadata is listed carefully, it’d likely also be reasonable for the files-in-a-directory use case to optionally implement.

Answer 3 · 2020-07-13T18:33:38.000Z

What would the filename of this file be? Something like pip-20.1.1-py2.py3-none-any.whl.METADATA?

Trying to think of alternatives: since METADATA is already RFC 822 compliant, we could include the metadata as headers on the response to requests for .whl files. Clients that only want the metadata could call HEAD on the URL, clients that want both metadata and the .whl file itself would call GET and get both in a single request. This would be a bit more challenging for PyPI to implement, though.

Answer 4 · 2020-07-13T18:40:09.000Z

It would also be more challenging for mirrors like bandersnatch to implement, since they don't have any runtime components where they could add those headers, but the bigger thing is header's can't be protected by TUF, and we definitely want this to be TUF protected.

The other option would be to embed this inside the TUF metadata itself, which is a JSON doc and has an area for arbitrary metadata to be added.. however I think that's worse for us since it's a much larger change in that sense, and sort of violates a bit of the separation of concerns we currently have with TUF.

As far as file name, I don't really have a strong opinion on it. something like pip-20.1.1-py2.py3-none-any.whl.METADATA works fine for me, there's a very clear marker for what file the metadata belongs to, and in the "serve a directory over HTTP" index, they could easily add that file too.

Answer 5 · 2020-07-13T18:47:41.000Z

Got it, I wasn't thinking that TUF couldn't protect headers but that makes sense in retrospect.

I don't see any significant issues with the proposal aside from the fact that PyPI will finally need to get into the business of unzipping/extracting/introspecting uploaded files. Do we think that should happen during the request (thus guaranteeing that the METADATA file is available immediately after upload, but potentially slowing down the request) or can it happen outside the request (by kicking off a background task)?

Answer 6 · 2020-07-13T19:08:06.000Z

Within the legacy upload API we will probably want to do it inline? I don't know, that's a great question for whoever writes the actual PEP to figure out the implications of either choice 😄 . #7730 is probably the right long term solution to that particular problem.

Answer 7 · 2020-07-13T19:31:07.000Z

Alternatively it might be nice to provide the entire *.dist-info directory as a separable part. Or, going the other direction, METADATA without long-description. Of course it can be different per each individual wheel.

Answer 8 · 2020-07-13T19:34:29.000Z

I thought about the entire .dist-info directory. If we did that we would probably want to re-zip it into a single artifact, It just didn't feel super worthwhile to me as I couldn't think of a use case for accessing files other than METADATA as part of the resolution/install process, which is all this idea really cared about. Maybe there's something I'm not thinking about though?

Answer 9 · 2020-07-13T19:42:11.000Z

Agreed, anything other than METADATA feels like YAGNI. After all, the only standardised files in .dist-info are METADATA, RECORD and WHEEL. RECORD is not much use without the full wheel, and there's not enough in WHEEL to be worth exposing separately.

So unless there's a specific use case, like there is for METADATA, I'd say let's not bother.

Answer 10 · 2020-07-13T20:47:10.000Z

Off the top of my head the entry points are the most interesting metadata not in 'METADATA'

Answer 11 · 2020-07-14T15:54:51.000Z

Are we expecting to backfill metadata for a few versions of popular projects, particularly those that aren't released often?

Answer 12 · 2020-07-14T21:15:45.000Z

What do folks think?

I quite like it. :)

pip-20.1.1-py2.py3-none-any.whl.METADATA

👍 I really like that this makes it possible for static mirrors to provide this information! :)

not sure who else work on poetry, feel free to CC more folks in

@abn @finswimmer @stephsamson

My main concern is the same as @ofek -- how does this work with existing uploads? Would it make sense for PyPI to have a "backfill when requested" approach for existing uploads?

Answer 13 · 2020-07-14T22:01:38.000Z

I think we'd just backfill this for every .whl distribution that has a METADATA file in it?

Answer 14 · 2020-07-14T23:00:28.000Z

Yea. It would take some time but we would presumably just do a backfill operation.

…

Sent from my iPhone

On Jul 14, 2020, at 6:01 PM, Dustin Ingram ***@***.***> wrote: I think we'd just backfill this for every .whl distribution that has a METADATA file in it? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Answer 15 · 2020-07-14T23:06:55.000Z

and there's not enough in WHEEL to be worth exposing separately

In pip at least we extract and parse WHEEL first, to see if we can even understand the format of the wheel. In a future where we actually want to exercise that versioning mechanism, if we make WHEEL available from the start then we can avoid considering new wheels we wouldn't be able to use. If we don't take that approach then projects may hesitate to release new wheels because it would cause users' pips to fully resolve then backtrack (or error out) when encountering a new wheel once downloaded.

Answer 16 · 2020-07-15T03:09:31.000Z

It seems to me like we could side step this issue by simply having PyPI extract the METADATA file of a wheel as part of the upload process, and storing that alongside the wheel itself.

Great idea! In fact, we should be able to list this METADATA as yet another TUF targets file, and associate it with all of its wheels using custom targets metadata... @woodruffw @mnm678

Answer 17 · 2020-07-15T13:57:55.000Z

Great idea! In fact, we should be able to list this METADATA as yet another TUF targets file, and associate it with all of its wheels using custom targets metadata... @woodruffw @mnm678

Yep! This should be doable, as long as it's part of (or relationally connected to) the Release or File models.

Answer 18 · 2020-07-15T14:00:48.000Z

What information do you need stored in the DB? In my head I just assumed it would get stored alongside the file in the object store. I guess probably the digest of the METADATA file?

Answer 19 · 2020-07-15T14:03:38.000Z

What information do you need stored in the DB? In my head I just assumed it would get stored alongside the file in the object store. I guess probably the digest of the METADATA file?

Yep, exactly. We wouldn't need the METADATA filename itself stored, assuming that it can be inferred (i.e. that it's always {release_file}.METADATA).

Answer 20 · 2020-10-21T16:17:41.000Z

Cross-post of a proposition I made on discuss.python.org:

Provide distribution metadata as ‘data’ attributes on links in simple index

In short: we extend the concept of the data-requires-python attribute to cover all the necessary metadata for pip's dependency resolution.

(There are some more details in the post over there, and in #8733)

So that is a different take on this issue. I opened a dedicated ticket in case there is enough interest for further discussion: #8733

[note from @pradyunsg: I've trimmed content duplicated here and in #8733, about the proposed design. This is to prevent this proposal from taking over the discussion here]

Answer 21 · 2020-11-19T16:38:45.000Z

This would probably require a PEP

Looks like I never said this explicitly, but yes, I definitely think this needs a PEP.

Answer 22 · 2020-12-02T07:55:08.000Z

FYI: PEP-643 (Metadata for Package Source Distributions) has been approved. 🚀

Answer 23 · 2021-05-06T14:20:48.000Z

What are the next steps here @ewdurbin @dstufft @pfmoore? Writing a PEP to update the simple API standard, to allow inclusion of metadata files?

Answer 24 · 2021-05-06T14:34:23.000Z

I think so, yes. Someone needs to decide how to include a link to the metadata file in the simple API (in a backward compatible way). I'd suggest following the GPG signature approach:

If there is a metadata file for a particular distribution file it MUST live alongside that file with the same name with a .metadata appended to it. So if the file /packages/HolyGrail-1.0-py3-none-any.whl existed and had associated metadata, the metadata would be located at /packages/HolyGrail-1.0-py3-none-any.whl.metadata. Metadata MAY be provided for any file type, but it is expected that package indexes will only provide it for wheels, in practice.

A PEP for that should be reasonably straightforward and non-controversial. I assume @dstufft would be PEP delegate and could approve it reasonably easily.

After that, it's a case of

Adding support for the new data into PyPI. That's probably the hardest bit.
Adding support to clients like pip.

It might be nice to also expose that data via the JSON API, seeing as we're going to have it available - but that's a separate question. Let's get the basic feature in place first!

Answer 25 · 2021-05-06T14:40:18.000Z

By the way, if we can get this available sooner rather than later, it would save me running any more of my job that's downloading 7TB of wheels from PyPI just to extract the metadata file 🙂

Don't worry @ewdurbin - I promise I won't actually run it for all of those files, I'll pass on the hundreds of 800MB PyTorch wheels, for a start!

Answer 26 · 2021-05-06T14:52:51.000Z

(@pfmoore this is a bit off topic, but I wrote some code a few months ago that leverages HTTP Range requests and the zip format to minimize the number of bytes you need to extract metadata from wheels retrieved through HTTP :) https://github.com/python-poetry/poetry/pull/1803/files#diff-5a449c4763ca5e9acfa4dbb6bc875866981cb3d8ca12121ae78035faa16999b2R525 . If it helps you in any way, I'm glad!)

Answer 27 · 2021-05-06T14:59:28.000Z

And pip has similar code too, behind fast-deps. :)

Answer 28 · 2021-05-06T15:00:43.000Z

Also, I'll note that I was thinking of adding a data-has-static-metadata to the link tag, to denote that it has static metadata available.

Answer 29 · 2021-05-08T17:28:37.000Z

I’ll spend some time writing a draft in the coming days unless there’s already someone doing it (doesn’t seem the case from recent comments?)

Answer 30 · 2021-05-08T17:31:26.000Z

The other option would be to expose the entire dist-info folder...

…

On Sat, May 8, 2021, at 1:28 PM, Tzu-ping Chung wrote: I’ll spend some time writing a draft in the coming days unless there’s already someone doing it (doesn’t seem the case from recent comments?) — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#8254 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABSZESHOYZYMNVVTI5VOM3TMVYFHANCNFSM4OYW7VJQ>.

Answer 31 · 2021-05-08T17:31:59.000Z

adding a data-has-static-metadata to the link tag

What would be the advantage of this? If we follow Paul’s naming proposal above, whether a link has static metadata can be canonically detected by searching for an <a> tag containing f"{filename}.metadata".

Answer 32 · 2021-05-08T17:40:34.000Z

I'm pretty strongly inclined to only expose what there's a clearly established need for (i.e. the metadata file). It's easier to add more later than to remove something once it's added.

Answer 33 · 2021-05-08T17:45:49.000Z

Also, I'll note that I was thinking of adding a data-has-static-metadata to the link tag, to denote that it has static metadata available.

At this point, it might be worth adding the METADATA file as a link data attribute (whole or relevant parts, base-64 encoded or whatever).

Answer 34 · 2021-05-08T18:28:24.000Z

adding the METADATA file as a link data attribute (whole or relevant parts, base-64 encoded or whatever).

That would create huge index pages and will probably make things slower in practice. The metadata file contains free form fields like Description and I know there are projects with super long content in there (because they include their super long README). And the index would then include METADATA from every single version ever published, times every platform wheel for each version.

Answer 35 · 2021-05-08T18:31:46.000Z

Even just including the relevant parts could be wasteful if there are more than a few releases in a project. You'd be throwing away most of them most of the time since it's very rare the resolver would ever need all of them.

Answer 36 · 2021-05-08T18:37:20.000Z

It's easier to add more later than to remove something once it's added.

This brings out a consideration on the entry’s name on the index though, f"{filename}.metadata" would make it more difficult to include more files if we ever decide to. I’m current leaning toward specifying METADATA by its “full” name instead, e.g. distribution-1.0-py3-none-any.whl/distribution-1.0.dist-info/METADATA.

Answer 37 · 2021-05-08T18:43:35.000Z

Another silly idea would be to include the offset and length for easy range requests against the zip file .whl

…

On Sat, May 8, 2021, at 2:37 PM, Tzu-ping Chung wrote: > It's easier to add more later than to remove something once it's added. This brings out a consideration on the entry’s name on the index though, `f"{filename}.metadata"` would make it more difficult to include more files if we ever decide to. I’m current leaning toward specifying `METADATA` by its “full” name instead, e.g. `distribution-1.0-py3-none-any.whl/distribution-1.0.dist-info/METADATA`. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#8254 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABSZETD5EFQ5P3EFWIJY5DTMWAHDANCNFSM4OYW7VJQ>.

Answer 38 · 2021-05-08T19:04:17.000Z

That would create huge index pages [...] the index would then include METADATA from every single version ever published, times every platform wheel for each version.

You'd be throwing away most of them most of the time since it's very rare the resolver would ever need all of them.

Agreed. There would be some waste for sure.

[Thinking out loud... I don't know what should be optimized: number of requests or size of payloads. I have no idea what those numbers are, where the critical point is. Newer protocols seem to make it relatively painless to make multiple requests.]

Answer 39 · 2021-05-10T14:31:52.000Z

I’ve drafted a PEP: python/peps#1955. I think I’ll need a sponsor and a PEP delegate (can they be the same person?)

Update: This is now PEP 658 https://www.python.org/dev/peps/pep-0658/

Answer 40 · 2021-05-13T17:02:31.000Z

And now the PEP was merged 🎉

Is the implementation up for grabs, or do you plan to tackle it @uranusjr?

Also, trying to expose as much design decisions as possible early to help whoever will implement:

is Warehouse expected to
- store the metadata file on S3 / something ?
- store the text contents in the database?
- Extract the file on request from the wheel (and make sure we have a very long CDN cache)
What is the expected content type of a Metadata file? (text/plain ?)

Answer 41 · 2021-05-13T17:12:02.000Z

The PEP is currently being discussed at https://discuss.python.org/t/8651.

Anything that isn't specific to Warehouse itself should probably be discussed there.

Answer 42 · 2021-05-13T17:14:50.000Z

Ah, sorry, got confused 😅

Answer 43 · 2021-05-13T17:42:59.000Z

(I’m limiting the response to Warehouse-specific; others should be posted in the Discourse thread for visibility.)

Is the implementation up for grabs, or do you plan to tackle it.

It’s up for grabs. I plan to take a look into it only if nobody does anything and I can’t take it anymore. My threshold on this kind of things is pretty high so I suggest someone else work on this if they really want this to happen sooner than later 🙂

is Warehouse expected to

store the metadata file on S3 / something ?

store the text contents in the database?

Extract the file on request from the wheel (and make sure we have a very long CDN cache)

I haven’t thought that far tbh. I’d probably store the file separately on S3 if I have to implement it right now (I believe that’s how the GPG key is stored?) but I’m not the right person to make the decision either way.

What is the expected content type of a Metadata file? (text/plain ?)

Good question! I almost wanted to write it in the PEP but decided otherwise since it doesn’t really matter (PEP 503 didn’t specify the content type for wheel either), and I don’t know the answer. I guess text/plain is good enough. Feel free to ask this in the Discourse thread if feel the PEP should explicitly specify one.

Answer 44 · 2021-05-13T22:55:56.000Z

Feel free to ask this in the Discourse thread if feel the PEP should explicitly specify one.

(not sure this needs to be in the PEP, but this will need to be in the implementation somehow)

Answer 45 · 2021-08-20T12:16:51.000Z

It would help a lot to understand how this API work if anybody could provide a Jupyter or Observable notebook with a complete example of accessing the API for fetching dependency information.

Answer 46 · 2021-08-21T22:41:37.000Z

The client-side logic is basically

def find_metadata(anchor):
    """Fetch distribution metadata with PEP 658."""
    try:
        metadata_validation = anchor.attrs["data-dist-info-metadata"].partition("=")
    except KeyError:
        raise PEP658NotAvailable()
    if metadata_validation == "true":
        # Skip validation.
        hashname = hashvalue = None
    else:
        hashname, sep, hashvalue = metadata_validation.partition("=")
        if not sep:
            raise InvalidDataDistInfoMetadataAttr()
        if hashname not in hashlib.algorithms_available:
            raise HashAlgorithmUnsupported()
    metadata_url = f"{anchor.attrs['href']}.metadata"
    metadata_content = _fetch_metadata(metadata_url)  # Download metadata with HTTP.
    if hashname is None:
        return metadata_content
    if hashlib.new(hashname, metadata_content).hexdigest() != hashvalue:
        raise HashMismatch()
    return metadata_content


# Find the `<a>` tag representing the file you want on a PEP 503 index page.
# You need to figure out this part yourself; it is out of scope of this issue and PEP 658.
anchor = _find_distribution_i_want()

metadata_content = find_metadata_content(anchor)
# Use a compatible parser (e.g. email.parser) to parse the metadata.

BTW PEP 658 has been accepted, so anyone interested please feel free to proceed on an implementation.

Answer 47 · 2021-08-23T05:49:43.000Z

BTW PEP 658 has been accepted, so anyone interested please feel free to proceed on an implementation.

Was there an attempt to reduce the length of the param? Looks like the proposed scheme is to add the anchor each line in https://pypi.org/simple/ like this.

-    <a href="/simple/abcmeta/">abcmeta</a>
+    <a href="/simple/abcmeta/#data-dist-info-metadata=sha3-256=a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a">abcmeta</a>

Which will increase the size form the current 17M to about 60M.

Answer 48 · 2021-08-23T06:03:14.000Z

Instead of layering more hacks on top of existing hacks, may I propose to create a /simple.csv endpoint which will host the extensible CSV with header.

name, metahashtype, metahashvalue
abcmeta, sha3-256, a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a

name	metahashtype	metahashvalue
abcmeta	sha3-256	a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a

Then anybody could get the info without the custom html and anchor parser. That would be easier implementation both on the client and on server side.

I would also consider using a single hashing function to reduce possible duplication when implementing distributed PyPI with content addressing later.

Answer 49 · 2021-08-23T08:18:08.000Z

Looks like the proposed scheme is to add the anchor each line in https://pypi.org/simple/ like this.

According to the PEP it's added to the project page (e.g. https://pypi.org/simple/pydash) rather than the index page. It would look something like this (assuming I've interpreted the PEP correctly).

Answer 50 · 2021-08-23T08:22:26.000Z

@domdfcoding is correct. The PEP describes the per-project index pages.

Answer 51 · 2021-08-23T10:34:16.000Z

@pradyunsg if it is not possible to clarify in Title the PEP is only for project pages, then it would save me quite a bit of time if it was specified in Abstract.

Without the code from @uranusjr and @domdfcoding I am not sure that I can interpret PEP correctly. Why there are no examples in the PEPs itself?

Answer 52 · 2021-08-23T11:54:50.000Z

Back to the implementation. I am interested to estimate the size of dependency metadata, which is, according to this chapter may be too long lists to be included on the page.

Am I right, that for now the .metadata file is not extracted of stored anywhere?

If that is the case, then.

find the place on the upload pipeline where the .metadata should be extracted
find the place where the .metadata can be stored
write the code to extract the .metadata and store it
add endpoint to serve .metadata (I guess there is no handler, and it will be just a link to CDN)

Answer 53 · 2021-08-23T16:07:15.000Z

You would probably want to just mimic how files themselves are stored/served. I would also just store the metadata file alongside the artifact itself in the object store. Look at how PGP signatures are handled basically.

The the question of extraction still needs to be solved. Under forklift is the legacy upload API. The simplest thing to do would be to extract it inline with the upload, but we'd need to see if that ends up being a massive drag on speed or memory in the uploader. If so we might have to do it asynchronously.

Answer 54 · 2021-08-30T13:08:20.000Z

An illustration why this issue is so important. Imagine a load on PyPI if every lib does such dependency resolution. That probably skews download stats a lot.

I can start with creating a reference notebook for fetching dependencies for a project like https://pypi.org/project/freemocap/ from warehouse. I'd prefer an alternative REST API that could be added instead of the simple one, but for now I need to know the command for extracting METADATA file. If the upload is wheel, which means .zip, it is possible to stream scan and extract it without loading it fully into memory.

Under forklift is the legacy upload API.

@dstufft does that mean the API is going to be deprecated, but there is no other API yet? I am looking at the code right now to see where the upload happens.

Answer 55 · 2021-08-30T13:31:25.000Z

The PyPI simple API is a REST api FWIW, it just has a serialization format that isn't great for complex data.

The plan was to, at some point, come up with a better uploadAPI, ideally one that could work asynchronously. However it remains to be seen if that ever actually happens.

Answer 56 · 2021-08-30T13:35:03.000Z

So the https://warehouse.readthedocs.io/api-reference/legacy.html#upload-api is the only upload API right now. Is that correct?

EDIT: Found the forklift is the module that implements upload API - https://warehouse.pypa.io/application.html#file-and-directory-structure

Answer 57 · 2021-08-30T14:19:01.000Z

Interesting that during upload there is a form that is checking the metadata.

https://github.com/pypa/warehouse/blob/3943226cf1168f5cead40913d42603e9d1f25010/warehouse/forklift/legacy.py#L853

And that metadata is not coming from a wheel. And it even seems to include dependencies. Is it stored in database?

Answer 58 · 2021-08-30T14:42:29.000Z

The file processing in upload starts from https://github.com/pypa/warehouse/blob/3943226cf1168f5cead40913d42603e9d1f25010/warehouse/forklift/legacy.py#L1172-L1173

The actual extraction of the METADATA can be made together with calculating hashes in https://github.com/pypa/warehouse/blob/3943226cf1168f5cead40913d42603e9d1f25010/warehouse/forklift/legacy.py#L1215-L1217

There is code that can be potentially reused from pip for selective extraction from zip streams https://github.com/pypa/pip/blob/df98167fe54b99f61dd000ae619fd4b1c634c039/src/pip/_internal/network/lazy_wheel.py

Answer 59 · 2021-08-30T15:08:21.000Z

Lazy-reading requires random access and I don’t think you can easily use it with the streaming approach used in forklift. It’s much easier to just re-open temporary_filename in read mode afterwards and pass it to zipfile.ZipFile.

Answer 60 · 2021-08-30T15:12:53.000Z

It appears thatZipFile doesn't support scanning for filenames in .zip stream. It does a direct lookup to the central directory at the end of file https://github.com/python/cpython/blob/720aef48b558e68c07937f0cc8d62a60f23dcb3d/Lib/zipfile.py#L1256-L1257 I don't want to write my own scanner, so opening a ZipFile on existing temporary_file is indeed a much easier way.

Answer 61 · 2021-08-30T22:00:00.000Z

The code for extraction is almost complete. I just need to find a reliable way to calculate the name of .dist-info dir, where METADATA file is located. For vodka-3.1.0-py3-none-any.whl this would be vodka-3.1.0.dist-info/METADATA.

Answer 62 · 2021-08-30T22:28:44.000Z

Generally, I search for the file in the wheel whose name ends with .dist-info/METADATA. In theory, it should be <normalised project name>-<version>.dist-info/METADATA, but a lot of projects don't correctly normalise the name in the wheel, so it's safer to not require that. I'd recommend failing if there's more than one file matching *.dist-info/METADATA, too.

Answer 63 · 2021-08-31T04:50:21.000Z

#9972 is ready for review once somebody approves tests to be run.

@pfmoore I used {distribution}-{version}.dist-info/METADATA for lookup. Should be safe, because everything else is an invalid wheel according to the spec - https://www.python.org/dev/peps/pep-0427/#file-name-convention However, I don't mind if somebody can show me how to check the contents of all latest wheels from PyPI without downloading them on 25Mbps connection.

There is no patches to simple API. It is better to be addressed in a separate request.

Answer 64 · 2021-08-31T05:16:08.000Z

I think using wheel_info.group("namever") should be good enough for now. We can always improve that after that info is exposed to the API; it’d be way easier to check when that’s done.

Answer 65 · 2021-09-02T21:27:34.000Z

Fixed tests for #9972. Can somebody give approval to GitHub Actions to run them?

Answer 66 · 2021-09-03T07:27:35.000Z

The #9972 wheel needs another kick. Sorry for the noise, hopefully it will be over soon. :D

Answer 67 · 2021-09-03T08:31:24.000Z

Please drop comments on the PR, instead of notifying everyone on this issue.

Answer 68 · 2023-05-11T19:25:37.000Z

PEP 658 Implementation is now merged. Wheels uploaded after it has deployed will have associated .metadata files served alongside them per spec.

We don't currently have a backfill plan, but will consider it after seeing how pips that support 658 react to the new stuff :)

Answer 69 · 2023-05-11T19:39:40.000Z

Yay! I think that pip has supported this since 22.3 (pypa/pip#11111) although obviously it won't have got much testing yet. I don't know if there's any data we can usefully collect - I guess PyPI download stats on the new metadata files will be a good indication, though.

Answer 70 · 2023-05-11T20:37:03.000Z

extremely rough sketch that proofs out (hopefully) lightweight way of backfilling:

from pathlib import Path

from pip._internal.network.lazy_wheel import dist_from_wheel_url
from pip._internal.network.session import PipSession

from warehouse.packaging.models import File

session = PipSession()

for file in db.query(File).filter(File.packagetype == "bdist_wheel").filter(File.metadata_file_sha256_digest == None).yield_per(100):
    metadata = dist_from_wheel_url(file.release.project.name, f"https://files.pythonhosted.org/packages/{file.path}", session)._dist._files[Path('METADATA')]
    print(f'processed {file}')

Answer 71 · 2023-07-13T08:35:32.000Z

Just a note that PyPI now also supports PEP 714 (alongside with PEP 658), which fixed a bug in the spec.

Should this issue be closed, or kept open to track backfilling?

Answer 72 · 2023-10-01T11:38:35.000Z

Should this issue be closed, or kept open to track backfilling?

My 2 cents: Given that #13705 tracked the changes that were needed for PEP 714, let's use this issue to track the backfilling.

Answer 73 · 2023-12-17T12:14:14.000Z

PEP 658 Implementation is now merged. Wheels uploaded after it has deployed will have associated .metadata files served alongside them per spec.

We don't currently have a backfill plan, but will consider it after seeing how pips that support 658 react to the new stuff :)

What are the next steps for backfilling?

Now that Warehouse has been generating the .metadata files since ~May 2023, and pip has supported them since pip 23.2 (where the parsing fixes + PEP 714 changes landed) which was released 2023-07-15 - I presume we're happy to say that there's been sufficient testing to confirm everything is working as intended, and as such there is no harm in backfilling?

Answer 74 · 2023-12-19T15:28:22.000Z

@edmorley Essentially a PyPI admin needs to run the script in #8254 (comment) (plus add functionality to actually write the data to PyPI's object store)

Answer 75 · 2024-02-10T01:13:38.000Z

Anything someone like me can do to help make this happen?

Answer 76 · 2024-02-13T00:11:08.000Z

@brettcannon You can review #15368 if you'd like! 🙂

Answer 77 · 2024-02-13T20:21:46.000Z

OK, the backfill is in progress. We have about 4M wheels to backfill metadata for, which I currently expect to take about 30 days, working from the newest wheel to the oldest.

Answer 78 · 2024-02-13T21:28:20.000Z

Out of curiosity, what is the bottleneck?

Answer 79 · 2024-02-13T22:21:40.000Z

No bottleneck, that's just how we have it configured to run.

Answer 80 · 2024-02-13T23:02:29.000Z

Anyone in this thread have thoughts about what should happen with 'invalid' wheels like this one? When pip tries to extract the metadata from this, we get:

UnsupportedWheel: scipy1101 has an invalid wheel, scipy1101 has an invalid wheel, .dist-info directory 'scipy-1.10.1.dist-info' does not start with 'scipy1101'

Answer 81 · 2024-02-13T23:43:15.000Z

For now, we'll just skip them: #15374 and #15375

Answer 82 · 2024-02-14T00:21:19.000Z

I think you should only search for the suffix: https://github.com/pypa/hatch/blob/hatch-v1.9.3/src/hatch/index/publish.py#L19-L51

Answer 83 · 2024-02-14T05:21:32.000Z

Anyone in this thread have thoughts about what should happen with 'invalid' wheels like this one? When pip tries to extract the metadata from this, we get:

UnsupportedWheel: scipy1101 has an invalid wheel, scipy1101 has an invalid wheel, .dist-info directory 'scipy-1.10.1.dist-info' does not start with 'scipy1101'

Do not allow to upload them in the first place.

Answer 84 · 2024-02-14T06:24:04.000Z

@di skipping seems reasonable. It’s unfortunate, because consumers will have to fall back to downloading to discover the wheel is bad, but the standard doesn’t have a way to mark bad wheels.

@abitrolly For the future, maybe, but we have to deal with what was uploaded in the past.

@ofek that’s not what the spec says, and I know for a fact that if you do that you’ll hit problems with wheels containing multiple .dist-info directories (and the one you want isn’t necessarily the first one…)

Answer 85 · 2024-02-14T10:30:47.000Z

@pfmoore what is the process of dealing with malicious wheels? Why it can not be applied here?

Answer 86 · 2024-02-14T10:49:00.000Z

They aren't technically malicious, just malformed under the current spec (I don't know if they were valid under older versions of the spec without checking). The code in the wheel is itself presumably perfectly correct and usable. At least it will be in the example @di quoted. And with 4M wheels to backfill, a manual process¹ isn't going to scale, even if only a tiny proportion are failing.

Which the process for malicious wheels will definitely be. ↩

Answer 87 · 2024-02-14T11:07:51.000Z

@pfmoore if the wheel works and pip can install it and extract the metadata, why backfill job can not do this? Empty or partial metadata with hacks is better that maintaining new logic with data structures, which will crawl into API and then require workarounds in tools.

Answer 88 · 2024-02-14T13:16:17.000Z

if the wheel works and pip can install it and extract the metadata, why backfill job can not do this?

I didn't say pip can install it - I haven't checked that. All I'm saying is that removing existing wheels from PyPI isn't something we can just do in an automated fashion, and the scale of the problem makes it impractical to do so manually. Skipping existing wheels that can't be backfilled is a simple, practical solution. For new wheels, if the metadata can't be extracted I'd expect PyPI to reject the upload, so it's only a problem with the occasional older wheel. And the backfill data is only an optimisation in any case.

I think we've probably spent more time debating the issue than it warrants, TBH.

Answer 89 · 2024-02-14T13:33:45.000Z

Yea, good old times when you can endlessly discuss optimizations is over. It is all about busyness now.

Answer 90 · 2024-02-14T14:52:33.000Z

@ofek that’s not what the spec says, and I know for a fact that if you do that you’ll hit problems with wheels containing multiple .dist-info directories (and the one you want isn’t necessarily the first one…)

If you're talking about this living document then I think the spec does not require specific naming for the metadata directory. From experience, one cannot rely on '-'.join(wheel_name.split('-')[:2]) being an exact match for that internal directory. Originally I had that as the logic but backends do weird things and users reported being unable to upload some wheels which is why I changed to the logic that I sent you.

This is also what other tools like twine do, which makes sense because the wheels had to end up on PyPI somehow...

Answer 91 · 2024-02-14T15:19:57.000Z

See https://packaging.python.org/en/latest/specifications/binary-distribution-format/#file-contents:

{distribution}-{version}.dist-info/ contains metadata.

Agreed, not all wheels you'll encounter on PyPI follow that spec, but I believe that's mostly legacy issues. I think (hope!) all current backends follow the spec. And if they don't, it's a bug in the backend.

My experience comes from writing code that read all wheels from PyPI and extracted metadata, so precisely this use case. For that, I did have the sort of logic you describe, but it gave me a number of rather serious silent failures - projects whose extracted metadata had a different project name (not just normalised differently, but completely wrong!) for example. I tightened up the logic, but I still couldn't be sure my heuristics were always giving correct data.

In this situation, where it is far better to get no metadata extracted than to get incorrect metadata, I stand by my statement that skipping anything that doesn't match the spec is the right way to go. Don't forget, pip (and other tools) will use the metadata to skip downloading the full file if possible, so bad metadata could render a perfectly valid file unusable.

Anyway, if we skip, we can do another pass later to try a bit harder to extract the correct metadata. If we make an attempt now and it's wrong, we have a big problem even finding which files need re-checking.

Answer 92 · 2024-02-15T20:23:15.000Z

The backfill seems to be going smoothly so I bumped the rate up to 2000 every 5 minutes or roughly 0.5M a day. There is currently 3.9M outstanding so I expect this to complete in about a week now.

Of the >100K files we've already processed on PyPI, only 3 have been flagged as having invalid metadata:

warehouse=> select filename from release_files where metadata_file_unbackfillable is True;
                          filename
------------------------------------------------------------
 MNIST_dir-0.2.0-py3-none-any.whl
 MNIST_dir-0.2-py3-none-any.whl
 pyg_library-0.2.0.dev20230413+pt113cpu-cp310-cp310-any.whl
(3 rows)

if that rate continues I think the plan of just skipping these indefinitely seems fine.

Answer 93 · 2024-02-15T20:38:24.000Z

Of the >100K files we've already processed on PyPI, only 3 have been flagged as having invalid metadata:

I thought in your earlier message you'd said one of the problem files was scipy1101 - did I misunderstand?

Answer 94 · 2024-02-15T20:53:05.000Z

@pfmoore That file was from TestPyPI, it just happened to be the first one.

Answer 95 · 2024-02-16T10:38:13.000Z

if the wheel works and pip can install it and extract the metadata, why backfill job can not do this?

Mentioning this for posterity and not for inviting continued discussion: pip will refuse to install such a wheel, so the "if" here evaluates to False.

Answer 96 · 2024-02-20T15:52:03.000Z

We ran up against ratelimiting issues with our object storage backend, so the task did not remain at a rate of 2000 / 5 minutes for very long. Currently at 700 / 5 minutes, I will periodically bump that as we try to find the (undocumented) ratelimit.

Answer 97 · 2024-02-21T09:54:51.000Z

the rate up to 2000 every 5 minutes or roughly 0.5M a day. There is currently 3.9M outstanding so I expect this to complete in about a week now.

https://www.wolframalpha.com/input?i=3.9+million+%2F+%282000+%2F+hour%29+in+days

Doing the math on this, I'm getting >80 days rather than about a week? Moving the rate down to 700/hr makes it >250 days.

Answer 98 · 2024-02-21T09:56:05.000Z

Ah, I was doing my math wrong and it's 2000 per 5 minutes, not 2000 per hour. :)

Answer 99 · 2024-02-23T23:36:03.000Z

At 700 / (5 minutes) it will take less than 20 days: https://www.wolframalpha.com/input?i=3.9+million+%2F+%28700%2F+%285+minutes%29%29+in+days

Answer 100 · 2024-02-26T15:21:25.000Z

100th comment!

With about 1.6M packages remaining, current forecast shows the task should be complete by March 4th.

Footnotes