pypi/warehouse

Expose the METADATA file of wheels in the simple API

dstufft opened this issue · 124 comments

Currently a number of projects are trying to work around the fact that in order to resolve dependencies in Python you have to download the entire wheel in order to read the metadata. I am aware of two current strategies for working around this, one is the attempt to use the PyPI JSON API (which isn't a good solution because it's non standard, the data model is wrong, and it's not going to be secured by TUF) and the other is attempting to use range requests to fetch only the METADATA file from the wheel before downloading the entire wheel (which isn't a good solution because TUF can currently only verify entire files, and it depends on the server supporting range requests, which not every mirror is going to support).

It seems to me like we could side step this issue by simply having PyPI extract the METADATA file of a wheel as part of the upload process, and storing that alongside the wheel itself. Within TUF we can ensure that these files have not been tampered with, by simply storing it as another TUF secured target. Resolvers could then download just the metadata file for a wheel they're considering as a candidate, instead of having to download the entire wheel.

This is a pretty small delta over what already exists, so it's more likely we're going to get it done than any of the broader proposals of trying to design an entire, brand new repository API or by ALSO retrofitting the JSON API inside of TUF.

The main problems with it is that the METADATA file might also be larger than needed since it contains the entire long description of the wheel and that it still leaves sdists unsolved (but they're not currently really solvable). I don't think either problem is too drastic though.

What do folks thinks? This would probably require a PEP and I probably don't have the spare cycles to do that right now, but I wanted to get the idea written down incase someone else felt like picking it up.

@pypa/pip-committers @pypa/pipenv-committers @sdispater (not sure who else work on poetry, feel free to CC more folks in).

Sounds like a good idea. It would probably need to be an optional feature of the API, as we need to keep the spec backward-compatible, and really basic "serve a directory over HTTP" indexes might not be able to support the API.

But that is a minor point. Basically +1 from me.

One pip-specific note on timing, though. It looks like pip will get range request based metadata extraction before this API gets formalised/implemented. That's fine, but I think that when this does become available, pip should drop the range approach and switch to just using this. That would be a performance regression for indexes that support range requests but not the new API, but IMO that's more acceptable than carrying the support cost for having both approaches.

I agree, it seems like a reasonable solution. If we design how the metadata is listed carefully, it’d likely also be reasonable for the files-in-a-directory use case to optionally implement.

di commented

What would the filename of this file be? Something like pip-20.1.1-py2.py3-none-any.whl.METADATA?

Trying to think of alternatives: since METADATA is already RFC 822 compliant, we could include the metadata as headers on the response to requests for .whl files. Clients that only want the metadata could call HEAD on the URL, clients that want both metadata and the .whl file itself would call GET and get both in a single request. This would be a bit more challenging for PyPI to implement, though.

It would also be more challenging for mirrors like bandersnatch to implement, since they don't have any runtime components where they could add those headers, but the bigger thing is header's can't be protected by TUF, and we definitely want this to be TUF protected.

The other option would be to embed this inside the TUF metadata itself, which is a JSON doc and has an area for arbitrary metadata to be added.. however I think that's worse for us since it's a much larger change in that sense, and sort of violates a bit of the separation of concerns we currently have with TUF.

As far as file name, I don't really have a strong opinion on it. something like pip-20.1.1-py2.py3-none-any.whl.METADATA works fine for me, there's a very clear marker for what file the metadata belongs to, and in the "serve a directory over HTTP" index, they could easily add that file too.

di commented

Got it, I wasn't thinking that TUF couldn't protect headers but that makes sense in retrospect.

I don't see any significant issues with the proposal aside from the fact that PyPI will finally need to get into the business of unzipping/extracting/introspecting uploaded files. Do we think that should happen during the request (thus guaranteeing that the METADATA file is available immediately after upload, but potentially slowing down the request) or can it happen outside the request (by kicking off a background task)?

Within the legacy upload API we will probably want to do it inline? I don't know, that's a great question for whoever writes the actual PEP to figure out the implications of either choice 😄 . #7730 is probably the right long term solution to that particular problem.

Alternatively it might be nice to provide the entire *.dist-info directory as a separable part. Or, going the other direction, METADATA without long-description. Of course it can be different per each individual wheel.

I thought about the entire .dist-info directory. If we did that we would probably want to re-zip it into a single artifact, It just didn't feel super worthwhile to me as I couldn't think of a use case for accessing files other than METADATA as part of the resolution/install process, which is all this idea really cared about. Maybe there's something I'm not thinking about though?

Agreed, anything other than METADATA feels like YAGNI. After all, the only standardised files in .dist-info are METADATA, RECORD and WHEEL. RECORD is not much use without the full wheel, and there's not enough in WHEEL to be worth exposing separately.

So unless there's a specific use case, like there is for METADATA, I'd say let's not bother.

Off the top of my head the entry points are the most interesting metadata not in 'METADATA'

ofek commented

Are we expecting to backfill metadata for a few versions of popular projects, particularly those that aren't released often?

What do folks think?

I quite like it. :)

pip-20.1.1-py2.py3-none-any.whl.METADATA

👍 I really like that this makes it possible for static mirrors to provide this information! :)

not sure who else work on poetry, feel free to CC more folks in

@abn @finswimmer @stephsamson


My main concern is the same as @ofek -- how does this work with existing uploads? Would it make sense for PyPI to have a "backfill when requested" approach for existing uploads?

di commented

I think we'd just backfill this for every .whl distribution that has a METADATA file in it?

and there's not enough in WHEEL to be worth exposing separately

In pip at least we extract and parse WHEEL first, to see if we can even understand the format of the wheel. In a future where we actually want to exercise that versioning mechanism, if we make WHEEL available from the start then we can avoid considering new wheels we wouldn't be able to use. If we don't take that approach then projects may hesitate to release new wheels because it would cause users' pips to fully resolve then backtrack (or error out) when encountering a new wheel once downloaded.

It seems to me like we could side step this issue by simply having PyPI extract the METADATA file of a wheel as part of the upload process, and storing that alongside the wheel itself.

Great idea! In fact, we should be able to list this METADATA as yet another TUF targets file, and associate it with all of its wheels using custom targets metadata... @woodruffw @mnm678

Great idea! In fact, we should be able to list this METADATA as yet another TUF targets file, and associate it with all of its wheels using custom targets metadata... @woodruffw @mnm678

Yep! This should be doable, as long as it's part of (or relationally connected to) the Release or File models.

What information do you need stored in the DB? In my head I just assumed it would get stored alongside the file in the object store. I guess probably the digest of the METADATA file?

What information do you need stored in the DB? In my head I just assumed it would get stored alongside the file in the object store. I guess probably the digest of the METADATA file?

Yep, exactly. We wouldn't need the METADATA filename itself stored, assuming that it can be inferred (i.e. that it's always {release_file}.METADATA).

Cross-post of a proposition I made on discuss.python.org:

In short: we extend the concept of the data-requires-python attribute to cover all the necessary metadata for pip's dependency resolution.

(There are some more details in the post over there, and in #8733)

So that is a different take on this issue. I opened a dedicated ticket in case there is enough interest for further discussion: #8733

[note from @pradyunsg: I've trimmed content duplicated here and in #8733, about the proposed design. This is to prevent this proposal from taking over the discussion here]

This would probably require a PEP

Looks like I never said this explicitly, but yes, I definitely think this needs a PEP.

FYI: PEP-643 (Metadata for Package Source Distributions) has been approved. 🚀

What are the next steps here @ewdurbin @dstufft @pfmoore? Writing a PEP to update the simple API standard, to allow inclusion of metadata files?

I think so, yes. Someone needs to decide how to include a link to the metadata file in the simple API (in a backward compatible way). I'd suggest following the GPG signature approach:

If there is a metadata file for a particular distribution file it MUST live alongside that file with the same name with a .metadata appended to it. So if the file /packages/HolyGrail-1.0-py3-none-any.whl existed and had associated metadata, the metadata would be located at /packages/HolyGrail-1.0-py3-none-any.whl.metadata. Metadata MAY be provided for any file type, but it is expected that package indexes will only provide it for wheels, in practice.

A PEP for that should be reasonably straightforward and non-controversial. I assume @dstufft would be PEP delegate and could approve it reasonably easily.

After that, it's a case of

  1. Adding support for the new data into PyPI. That's probably the hardest bit.
  2. Adding support to clients like pip.

It might be nice to also expose that data via the JSON API, seeing as we're going to have it available - but that's a separate question. Let's get the basic feature in place first!

By the way, if we can get this available sooner rather than later, it would save me running any more of my job that's downloading 7TB of wheels from PyPI just to extract the metadata file 🙂

Don't worry @ewdurbin - I promise I won't actually run it for all of those files, I'll pass on the hundreds of 800MB PyTorch wheels, for a start!

(@pfmoore this is a bit off topic, but I wrote some code a few months ago that leverages HTTP Range requests and the zip format to minimize the number of bytes you need to extract metadata from wheels retrieved through HTTP :) https://github.com/python-poetry/poetry/pull/1803/files#diff-5a449c4763ca5e9acfa4dbb6bc875866981cb3d8ca12121ae78035faa16999b2R525 . If it helps you in any way, I'm glad!)

And pip has similar code too, behind fast-deps. :)

Also, I'll note that I was thinking of adding a data-has-static-metadata to the link tag, to denote that it has static metadata available.

I’ll spend some time writing a draft in the coming days unless there’s already someone doing it (doesn’t seem the case from recent comments?)

adding a data-has-static-metadata to the link tag

What would be the advantage of this? If we follow Paul’s naming proposal above, whether a link has static metadata can be canonically detected by searching for an <a> tag containing f"{filename}.metadata".

I'm pretty strongly inclined to only expose what there's a clearly established need for (i.e. the metadata file). It's easier to add more later than to remove something once it's added.

Also, I'll note that I was thinking of adding a data-has-static-metadata to the link tag, to denote that it has static metadata available.

At this point, it might be worth adding the METADATA file as a link data attribute (whole or relevant parts, base-64 encoded or whatever).

adding the METADATA file as a link data attribute (whole or relevant parts, base-64 encoded or whatever).

That would create huge index pages and will probably make things slower in practice. The metadata file contains free form fields like Description and I know there are projects with super long content in there (because they include their super long README). And the index would then include METADATA from every single version ever published, times every platform wheel for each version.

Even just including the relevant parts could be wasteful if there are more than a few releases in a project. You'd be throwing away most of them most of the time since it's very rare the resolver would ever need all of them.

It's easier to add more later than to remove something once it's added.

This brings out a consideration on the entry’s name on the index though, f"{filename}.metadata" would make it more difficult to include more files if we ever decide to. I’m current leaning toward specifying METADATA by its “full” name instead, e.g. distribution-1.0-py3-none-any.whl/distribution-1.0.dist-info/METADATA.

That would create huge index pages [...] the index would then include METADATA from every single version ever published, times every platform wheel for each version.

You'd be throwing away most of them most of the time since it's very rare the resolver would ever need all of them.

Agreed. There would be some waste for sure.


[Thinking out loud... I don't know what should be optimized: number of requests or size of payloads. I have no idea what those numbers are, where the critical point is. Newer protocols seem to make it relatively painless to make multiple requests.]

I’ve drafted a PEP: python/peps#1955. I think I’ll need a sponsor and a PEP delegate (can they be the same person?)

Update: This is now PEP 658 https://www.python.org/dev/peps/pep-0658/

And now the PEP was merged 🎉

Is the implementation up for grabs, or do you plan to tackle it @uranusjr?

Also, trying to expose as much design decisions as possible early to help whoever will implement:

  • is Warehouse expected to
    • store the metadata file on S3 / something ?
    • store the text contents in the database?
    • Extract the file on request from the wheel (and make sure we have a very long CDN cache)
  • What is the expected content type of a Metadata file? (text/plain ?)

The PEP is currently being discussed at https://discuss.python.org/t/8651.

Anything that isn't specific to Warehouse itself should probably be discussed there.

Ah, sorry, got confused 😅

(I’m limiting the response to Warehouse-specific; others should be posted in the Discourse thread for visibility.)

Is the implementation up for grabs, or do you plan to tackle it.

It’s up for grabs. I plan to take a look into it only if nobody does anything and I can’t take it anymore. My threshold on this kind of things is pretty high so I suggest someone else work on this if they really want this to happen sooner than later 🙂

is Warehouse expected to

  • store the metadata file on S3 / something ?
  • store the text contents in the database?
  • Extract the file on request from the wheel (and make sure we have a very long CDN cache)

I haven’t thought that far tbh. I’d probably store the file separately on S3 if I have to implement it right now (I believe that’s how the GPG key is stored?) but I’m not the right person to make the decision either way.

What is the expected content type of a Metadata file? (text/plain ?)

Good question! I almost wanted to write it in the PEP but decided otherwise since it doesn’t really matter (PEP 503 didn’t specify the content type for wheel either), and I don’t know the answer. I guess text/plain is good enough. Feel free to ask this in the Discourse thread if feel the PEP should explicitly specify one.

Feel free to ask this in the Discourse thread if feel the PEP should explicitly specify one.

(not sure this needs to be in the PEP, but this will need to be in the implementation somehow)

It would help a lot to understand how this API work if anybody could provide a Jupyter or Observable notebook with a complete example of accessing the API for fetching dependency information.

The client-side logic is basically

def find_metadata(anchor):
    """Fetch distribution metadata with PEP 658."""
    try:
        metadata_validation = anchor.attrs["data-dist-info-metadata"].partition("=")
    except KeyError:
        raise PEP658NotAvailable()
    if metadata_validation == "true":
        # Skip validation.
        hashname = hashvalue = None
    else:
        hashname, sep, hashvalue = metadata_validation.partition("=")
        if not sep:
            raise InvalidDataDistInfoMetadataAttr()
        if hashname not in hashlib.algorithms_available:
            raise HashAlgorithmUnsupported()
    metadata_url = f"{anchor.attrs['href']}.metadata"
    metadata_content = _fetch_metadata(metadata_url)  # Download metadata with HTTP.
    if hashname is None:
        return metadata_content
    if hashlib.new(hashname, metadata_content).hexdigest() != hashvalue:
        raise HashMismatch()
    return metadata_content


# Find the `<a>` tag representing the file you want on a PEP 503 index page.
# You need to figure out this part yourself; it is out of scope of this issue and PEP 658.
anchor = _find_distribution_i_want()

metadata_content = find_metadata_content(anchor)
# Use a compatible parser (e.g. email.parser) to parse the metadata.

BTW PEP 658 has been accepted, so anyone interested please feel free to proceed on an implementation.

BTW PEP 658 has been accepted, so anyone interested please feel free to proceed on an implementation.

Was there an attempt to reduce the length of the param? Looks like the proposed scheme is to add the anchor each line in https://pypi.org/simple/ like this.

-    <a href="/simple/abcmeta/">abcmeta</a>
+    <a href="/simple/abcmeta/#data-dist-info-metadata=sha3-256=a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a">abcmeta</a>

Which will increase the size form the current 17M to about 60M.

Instead of layering more hacks on top of existing hacks, may I propose to create a /simple.csv endpoint which will host the extensible CSV with header.

name, metahashtype, metahashvalue
abcmeta, sha3-256, a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a
name metahashtype metahashvalue
abcmeta sha3-256 a7ffc6f8bf1ed76651c14756a061d662f580ff4de43b49fa82d80a4b80f8434a

Then anybody could get the info without the custom html and anchor parser. That would be easier implementation both on the client and on server side.

I would also consider using a single hashing function to reduce possible duplication when implementing distributed PyPI with content addressing later.

Looks like the proposed scheme is to add the anchor each line in https://pypi.org/simple/ like this.

According to the PEP it's added to the project page (e.g. https://pypi.org/simple/pydash) rather than the index page. It would look something like this (assuming I've interpreted the PEP correctly).

@domdfcoding is correct. The PEP describes the per-project index pages.

@pradyunsg if it is not possible to clarify in Title the PEP is only for project pages, then it would save me quite a bit of time if it was specified in Abstract.

Without the code from @uranusjr and @domdfcoding I am not sure that I can interpret PEP correctly. Why there are no examples in the PEPs itself?

Back to the implementation. I am interested to estimate the size of dependency metadata, which is, according to this chapter may be too long lists to be included on the page.

Am I right, that for now the .metadata file is not extracted of stored anywhere?

If that is the case, then.

  • find the place on the upload pipeline where the .metadata should be extracted
  • find the place where the .metadata can be stored
  • write the code to extract the .metadata and store it
  • add endpoint to serve .metadata (I guess there is no handler, and it will be just a link to CDN)

You would probably want to just mimic how files themselves are stored/served. I would also just store the metadata file alongside the artifact itself in the object store. Look at how PGP signatures are handled basically.

The the question of extraction still needs to be solved. Under forklift is the legacy upload API. The simplest thing to do would be to extract it inline with the upload, but we'd need to see if that ends up being a massive drag on speed or memory in the uploader. If so we might have to do it asynchronously.

An illustration why this issue is so important. Imagine a load on PyPI if every lib does such dependency resolution. That probably skews download stats a lot.

image

I can start with creating a reference notebook for fetching dependencies for a project like https://pypi.org/project/freemocap/ from warehouse. I'd prefer an alternative REST API that could be added instead of the simple one, but for now I need to know the command for extracting METADATA file. If the upload is wheel, which means .zip, it is possible to stream scan and extract it without loading it fully into memory.

Under forklift is the legacy upload API.

@dstufft does that mean the API is going to be deprecated, but there is no other API yet? I am looking at the code right now to see where the upload happens.

The PyPI simple API is a REST api FWIW, it just has a serialization format that isn't great for complex data.

The plan was to, at some point, come up with a better uploadAPI, ideally one that could work asynchronously. However it remains to be seen if that ever actually happens.

So the https://warehouse.readthedocs.io/api-reference/legacy.html#upload-api is the only upload API right now. Is that correct?

EDIT: Found the forklift is the module that implements upload API - https://warehouse.pypa.io/application.html#file-and-directory-structure

Interesting that during upload there is a form that is checking the metadata.

https://github.com/pypa/warehouse/blob/3943226cf1168f5cead40913d42603e9d1f25010/warehouse/forklift/legacy.py#L853

And that metadata is not coming from a wheel. And it even seems to include dependencies. Is it stored in database?

Lazy-reading requires random access and I don’t think you can easily use it with the streaming approach used in forklift. It’s much easier to just re-open temporary_filename in read mode afterwards and pass it to zipfile.ZipFile.

It appears thatZipFile doesn't support scanning for filenames in .zip stream. It does a direct lookup to the central directory at the end of file https://github.com/python/cpython/blob/720aef48b558e68c07937f0cc8d62a60f23dcb3d/Lib/zipfile.py#L1256-L1257 I don't want to write my own scanner, so opening a ZipFile on existing temporary_file is indeed a much easier way.

The code for extraction is almost complete. I just need to find a reliable way to calculate the name of .dist-info dir, where METADATA file is located. For vodka-3.1.0-py3-none-any.whl this would be vodka-3.1.0.dist-info/METADATA.

Generally, I search for the file in the wheel whose name ends with .dist-info/METADATA. In theory, it should be <normalised project name>-<version>.dist-info/METADATA, but a lot of projects don't correctly normalise the name in the wheel, so it's safer to not require that. I'd recommend failing if there's more than one file matching *.dist-info/METADATA, too.

#9972 is ready for review once somebody approves tests to be run.

@pfmoore I used {distribution}-{version}.dist-info/METADATA for lookup. Should be safe, because everything else is an invalid wheel according to the spec - https://www.python.org/dev/peps/pep-0427/#file-name-convention However, I don't mind if somebody can show me how to check the contents of all latest wheels from PyPI without downloading them on 25Mbps connection.

There is no patches to simple API. It is better to be addressed in a separate request.

I think using wheel_info.group("namever") should be good enough for now. We can always improve that after that info is exposed to the API; it’d be way easier to check when that’s done.

Fixed tests for #9972. Can somebody give approval to GitHub Actions to run them?

The #9972 wheel needs another kick. Sorry for the noise, hopefully it will be over soon. :D

Please drop comments on the PR, instead of notifying everyone on this issue.

PEP 658 Implementation is now merged. Wheels uploaded after it has deployed will have associated .metadata files served alongside them per spec.

We don't currently have a backfill plan, but will consider it after seeing how pips that support 658 react to the new stuff :)

Yay! I think that pip has supported this since 22.3 (pypa/pip#11111) although obviously it won't have got much testing yet. I don't know if there's any data we can usefully collect - I guess PyPI download stats on the new metadata files will be a good indication, though.

extremely rough sketch that proofs out (hopefully) lightweight way of backfilling:

from pathlib import Path

from pip._internal.network.lazy_wheel import dist_from_wheel_url
from pip._internal.network.session import PipSession

from warehouse.packaging.models import File

session = PipSession()

for file in db.query(File).filter(File.packagetype == "bdist_wheel").filter(File.metadata_file_sha256_digest == None).yield_per(100):
    metadata = dist_from_wheel_url(file.release.project.name, f"https://files.pythonhosted.org/packages/{file.path}", session)._dist._files[Path('METADATA')]
    print(f'processed {file}')

Just a note that PyPI now also supports PEP 714 (alongside with PEP 658), which fixed a bug in the spec.

Should this issue be closed, or kept open to track backfilling?

Should this issue be closed, or kept open to track backfilling?

My 2 cents: Given that #13705 tracked the changes that were needed for PEP 714, let's use this issue to track the backfilling.

PEP 658 Implementation is now merged. Wheels uploaded after it has deployed will have associated .metadata files served alongside them per spec.

We don't currently have a backfill plan, but will consider it after seeing how pips that support 658 react to the new stuff :)

What are the next steps for backfilling?

Now that Warehouse has been generating the .metadata files since ~May 2023, and pip has supported them since pip 23.2 (where the parsing fixes + PEP 714 changes landed) which was released 2023-07-15 - I presume we're happy to say that there's been sufficient testing to confirm everything is working as intended, and as such there is no harm in backfilling?

di commented

@edmorley Essentially a PyPI admin needs to run the script in #8254 (comment) (plus add functionality to actually write the data to PyPI's object store)

Anything someone like me can do to help make this happen?

di commented

@brettcannon You can review #15368 if you'd like! 🙂

di commented

OK, the backfill is in progress. We have about 4M wheels to backfill metadata for, which I currently expect to take about 30 days, working from the newest wheel to the oldest.

ofek commented

Out of curiosity, what is the bottleneck?

di commented

No bottleneck, that's just how we have it configured to run.

di commented

Anyone in this thread have thoughts about what should happen with 'invalid' wheels like this one? When pip tries to extract the metadata from this, we get:

UnsupportedWheel: scipy1101 has an invalid wheel, scipy1101 has an invalid wheel, .dist-info directory 'scipy-1.10.1.dist-info' does not start with 'scipy1101'

di commented

For now, we'll just skip them: #15374 and #15375

Anyone in this thread have thoughts about what should happen with 'invalid' wheels like this one? When pip tries to extract the metadata from this, we get:

UnsupportedWheel: scipy1101 has an invalid wheel, scipy1101 has an invalid wheel, .dist-info directory 'scipy-1.10.1.dist-info' does not start with 'scipy1101'

Do not allow to upload them in the first place.

@di skipping seems reasonable. It’s unfortunate, because consumers will have to fall back to downloading to discover the wheel is bad, but the standard doesn’t have a way to mark bad wheels.

@abitrolly For the future, maybe, but we have to deal with what was uploaded in the past.

@ofek that’s not what the spec says, and I know for a fact that if you do that you’ll hit problems with wheels containing multiple .dist-info directories (and the one you want isn’t necessarily the first one…)

@pfmoore what is the process of dealing with malicious wheels? Why it can not be applied here?

They aren't technically malicious, just malformed under the current spec (I don't know if they were valid under older versions of the spec without checking). The code in the wheel is itself presumably perfectly correct and usable. At least it will be in the example @di quoted. And with 4M wheels to backfill, a manual process1 isn't going to scale, even if only a tiny proportion are failing.

Footnotes

  1. Which the process for malicious wheels will definitely be.

@pfmoore if the wheel works and pip can install it and extract the metadata, why backfill job can not do this? Empty or partial metadata with hacks is better that maintaining new logic with data structures, which will crawl into API and then require workarounds in tools.

if the wheel works and pip can install it and extract the metadata, why backfill job can not do this?

I didn't say pip can install it - I haven't checked that. All I'm saying is that removing existing wheels from PyPI isn't something we can just do in an automated fashion, and the scale of the problem makes it impractical to do so manually. Skipping existing wheels that can't be backfilled is a simple, practical solution. For new wheels, if the metadata can't be extracted I'd expect PyPI to reject the upload, so it's only a problem with the occasional older wheel. And the backfill data is only an optimisation in any case.

I think we've probably spent more time debating the issue than it warrants, TBH.

Yea, good old times when you can endlessly discuss optimizations is over. It is all about busyness now.

ofek commented

@ofek that’s not what the spec says, and I know for a fact that if you do that you’ll hit problems with wheels containing multiple .dist-info directories (and the one you want isn’t necessarily the first one…)

If you're talking about this living document then I think the spec does not require specific naming for the metadata directory. From experience, one cannot rely on '-'.join(wheel_name.split('-')[:2]) being an exact match for that internal directory. Originally I had that as the logic but backends do weird things and users reported being unable to upload some wheels which is why I changed to the logic that I sent you.

This is also what other tools like twine do, which makes sense because the wheels had to end up on PyPI somehow...

See https://packaging.python.org/en/latest/specifications/binary-distribution-format/#file-contents:

{distribution}-{version}.dist-info/ contains metadata.

Agreed, not all wheels you'll encounter on PyPI follow that spec, but I believe that's mostly legacy issues. I think (hope!) all current backends follow the spec. And if they don't, it's a bug in the backend.

My experience comes from writing code that read all wheels from PyPI and extracted metadata, so precisely this use case. For that, I did have the sort of logic you describe, but it gave me a number of rather serious silent failures - projects whose extracted metadata had a different project name (not just normalised differently, but completely wrong!) for example. I tightened up the logic, but I still couldn't be sure my heuristics were always giving correct data.

In this situation, where it is far better to get no metadata extracted than to get incorrect metadata, I stand by my statement that skipping anything that doesn't match the spec is the right way to go. Don't forget, pip (and other tools) will use the metadata to skip downloading the full file if possible, so bad metadata could render a perfectly valid file unusable.

Anyway, if we skip, we can do another pass later to try a bit harder to extract the correct metadata. If we make an attempt now and it's wrong, we have a big problem even finding which files need re-checking.

di commented

The backfill seems to be going smoothly so I bumped the rate up to 2000 every 5 minutes or roughly 0.5M a day. There is currently 3.9M outstanding so I expect this to complete in about a week now.

Of the >100K files we've already processed on PyPI, only 3 have been flagged as having invalid metadata:

warehouse=> select filename from release_files where metadata_file_unbackfillable is True;
                          filename
------------------------------------------------------------
 MNIST_dir-0.2.0-py3-none-any.whl
 MNIST_dir-0.2-py3-none-any.whl
 pyg_library-0.2.0.dev20230413+pt113cpu-cp310-cp310-any.whl
(3 rows)

if that rate continues I think the plan of just skipping these indefinitely seems fine.

Of the >100K files we've already processed on PyPI, only 3 have been flagged as having invalid metadata:

I thought in your earlier message you'd said one of the problem files was scipy1101 - did I misunderstand?

di commented

@pfmoore That file was from TestPyPI, it just happened to be the first one.

if the wheel works and pip can install it and extract the metadata, why backfill job can not do this?

Mentioning this for posterity and not for inviting continued discussion: pip will refuse to install such a wheel, so the "if" here evaluates to False.

di commented

We ran up against ratelimiting issues with our object storage backend, so the task did not remain at a rate of 2000 / 5 minutes for very long. Currently at 700 / 5 minutes, I will periodically bump that as we try to find the (undocumented) ratelimit.

the rate up to 2000 every 5 minutes or roughly 0.5M a day. There is currently 3.9M outstanding so I expect this to complete in about a week now.

https://www.wolframalpha.com/input?i=3.9+million+%2F+%282000+%2F+hour%29+in+days

Doing the math on this, I'm getting >80 days rather than about a week? Moving the rate down to 700/hr makes it >250 days.

Ah, I was doing my math wrong and it's 2000 per 5 minutes, not 2000 per hour. :)

100th comment!

With about 1.6M packages remaining, current forecast shows the task should be complete by March 4th.