pypa/pip

pip parses `dist-info-metadata` key from JSON indexes incorrectly

chriskuehl opened this issue · 36 comments

Description

When using a JSON-based package index, pip expects the dist-info-metadata key to always have a string value and does not parse the values described by the spec.

PEP 691 states:

dist-info-metadata: An optional key that indicates that metadata for this file is available, via the same location as specified in PEP 658 ({file_url}.metadata). Where this is present, it MUST be either a boolean to indicate if the file has an associated metadata file, or a dictionary mapping hash names to a hex encoded digest of the metadata’s hash.

pip tries to apply the same string parsing for this value as it does for the data-dist-info-metadata value in PEP 658 and does not handle booleans or dicts.

This does not work as the entries according to PEP 691 look like:

  {
    "dist-info-metadata": {
      "sha256": "8737b5dbb73f6b05e596c7c659f694a031de98b926e634716d9855fd365a5f2f"
    },
    "filename": "pre_commit-3.3.2-py2.py3-none-any.whl",
    [...]
  },

Expected behavior

pip should handle values with boolean or dict type as described in PEP 691.

pip version

Tested with 23.1.2 and current main

Python version

3.11

OS

Linux (debian bullseye)

How to Reproduce

To reproduce, we need access to a registry which supports PEP 691 JSON indexes and also PEP 658 metadata.

pypi.org is supposed to support both of these for new file uploads, but there is currently a bug where it provides the metadata under the wrong JSON key (pypi/warehouse#13705), so it is hard to test at the moment. For now, the easiest way to reproduce is to temporarily patch your pip code to use the (incorrect) key to trigger the bug.

  1. Temporarily patch your pip code so that it uses the key name provided by pypi.org:

    diff --git a/src/pip/_internal/models/link.py b/src/pip/_internal/models/link.py
    index e741c3283..5d1e0be4e 100644
    --- a/src/pip/_internal/models/link.py
    +++ b/src/pip/_internal/models/link.py
    @@ -262,7 +262,7 @@ class Link(KeyBasedCompareMixin):
             url = _ensure_quoted_url(urllib.parse.urljoin(page_url, file_url))
             pyrequire = file_data.get("requires-python")
             yanked_reason = file_data.get("yanked")
    -        dist_info_metadata = file_data.get("dist-info-metadata")
    +        dist_info_metadata = file_data.get("data-dist-info-metadata")
             hashes = file_data.get("hashes", {})
     
             # The Link.yanked_reason expects an empty string instead of a boolean.
  2. Try to install a package.

    • pip install --no-cache --only-binary :all: pre-commit==3.3.2 to reproduce pip trying to parse a metadata dict as a string
    • pip install --no-cache --only-binary :all: pre-commit==3.0.0 to reproduce pip trying to parse a boolean false as a string

Output

$ pip install --no-cache --only-binary :all: pre-commit==3.3.2          
Collecting pre-commit==3.3.2
ERROR: Exception:
Traceback (most recent call last):
  File "/home/ckuehl/proj/pip/src/pip/_internal/cli/base_command.py", line 169, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/cli/req_command.py", line 248, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/commands/install.py", line 377, in run
    requirement_set = resolver.resolve(
                      ^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/resolution/resolvelib/resolver.py", line 92, in resolve
    result = self._result = resolver.resolve(
                            ^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_vendor/resolvelib/resolvers.py", line 546, in resolve
    state = resolution.resolve(requirements, max_rounds=max_rounds)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_vendor/resolvelib/resolvers.py", line 397, in resolve
    self._add_to_criteria(self.state.criteria, r, parent=None)
  File "/home/ckuehl/proj/pip/src/pip/_vendor/resolvelib/resolvers.py", line 173, in _add_to_criteria
    if not criterion.candidates:
  File "/home/ckuehl/proj/pip/src/pip/_vendor/resolvelib/structs.py", line 156, in __bool__
    return bool(self._sequence)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/resolution/resolvelib/found_candidates.py", line 155, in __bool__
    return any(self)
           ^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/resolution/resolvelib/found_candidates.py", line 143, in <genexpr>
    return (c for c in iterator if id(c) not in self._incompatible_ids)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/resolution/resolvelib/found_candidates.py", line 47, in _iter_built
    candidate = func()
                ^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/resolution/resolvelib/factory.py", line 206, in _make_candidate_from_link
    self._link_candidate_cache[link] = LinkCandidate(
                                       ^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/resolution/resolvelib/candidates.py", line 293, in __init__
    super().__init__(
  File "/home/ckuehl/proj/pip/src/pip/_internal/resolution/resolvelib/candidates.py", line 156, in __init__
    self.dist = self._prepare()
                ^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/resolution/resolvelib/candidates.py", line 225, in _prepare
    dist = self._prepare_distribution()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/resolution/resolvelib/candidates.py", line 304, in _prepare_distribution
    return preparer.prepare_linked_requirement(self._ireq, parallel_builds=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/operations/prepare.py", line 510, in prepare_linked_requirement
    metadata_dist = self._fetch_metadata_only(req)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/operations/prepare.py", line 374, in _fetch_metadata_only
    return self._fetch_metadata_using_link_data_attr(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/operations/prepare.py", line 384, in _fetch_metadata_using_link_data_attr
    metadata_link = req.link.metadata_link()
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/models/link.py", line 417, in metadata_link
    metadata_link_hash = LinkHash.parse_pep658_hash(self.dist_info_metadata)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ckuehl/proj/pip/src/pip/_internal/models/link.py", line 77, in parse_pep658_hash
    name, sep, value = dist_info_metadata.partition("=")
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'partition'

Same traceback for pre-commit==3.0.0 except the last line is instead

AttributeError: 'bool' object has no attribute 'partition'

Code of Conduct

This needs an urgent fix, as pip will start failing on any package with a PEP 658 metadata file when PyPI release their fix. @pypa/pip-committers any thoughts? Should we release an emergency bug fix mid-cycle? How do we communicate the issue (as every version of pip back to 22.something will fail)? Is there anything Warehouse could do to mitigate this?

I’m happy to release an emergency fix, but I’m nervous about how we test it sufficiently in advance of Warehouse releasing their fix. And I don’t know how we handle the rollout issue.

On reflection, I was forgetting that PyPI haven't backfilled yet. Assuming they hold off on the backfill until this is fixed, the problem isn't as severe as I'd feared - it will only impact new wheel uploads. As a result, I think that an emergency release with advice that if people hit the issue, they have to upgrade pip should be sufficient. We'll still need to monitor the situation, and PyPI will need to consider how to co-ordinate the backfill with uptake of new pip versions, but we should be OK.

Assuming they hold off on the backfill until this is fixed, the problem isn't as severe as I'd feared - it will only impact new wheel uploads.

At the moment warehouse is setting "data-dist-info-metadata": false for files which don't have the metadata, so effectively all packages are broken right now with current pip versions. But perhaps warehouse could be changed to omit the key when there is no metadata file instead of setting false?

Do we know if our current PEP 658 implementation was tested with any real-world (private) index ? Could we use the same to test these fixes?

About the rollout I'm not sure either. It looks like the pip fix has to be released before the pypi.org fix, since all installs would otherwise fail without workaround ?

In that case it might also be an exceptional situation where we need to release a bug fix for a non current version (it's in since 22.3, actually), since some users may not be in a position to upgrade to 23.1 just yet, due to the recent removal of setup.py install support.

I'm in agreement that backporting and releasing bugfixes for older versions is a good idea (all the way back to when this was released), and something that we should do. I also agree that we should cut emergency-style bugfix releases for this (as is implied).

I'll take a look into the fix for this today (I'm at the airport for an 8+ hr flight, and this seems like a good thing to fill my hours + keep me awake), and review our PEP 658 implementation overall.

I'm happy to help with the release management work for this as well, although I'm unlikely to have a lot of "definitely free" time next week.

At the moment warehouse is setting "data-dist-info-metadata": false for files which don't have the metadata

Ouch, that's bad. But assuming PyPI fix this (the key is optional, so they can fix it) the damage is significantly limited. Enough to not need backports? I don't know. I was only thinking in terms of a 23.1.3 bugfix, which I'm happy to do, but @sbidoul has a point. I don't want to commit myself to doing backports, though, as I'm not sure I'll have the free time for that.

FWIW, the key has a typo right now. Once that typo is fixed, the pip bug will rear its head, so we should fix pip before that typo and I don't then pypi.org wouldn't need to do anything other than fix that typo.

That seems reasonable, my only question is, to what extent do we care about users who haven't upgraded their copy of pip - is "there's a fix, you need to upgrade" enough?

Is there a workaround for users who don't upgrade Pip? Like a command line flag to avoid using the PEP 658 metadata file?

Just asking because there will be some noise from users if PyPi stops working even if upgrading to a bugfix is available, many users have strange setups that make things which should be easy not always easy.

Is there a workaround for users who don't upgrade Pip?

Basically no. That's the big problem here, if an index serves JSON, we'll use it in preference to HTML, and if they provide a dist-info-metadata value we'll mis-parse it and fail. There's no flag that controls either of those two behaviours. And adding a flag is (of course) pointless, because the user could just as easily upgrade to a fixed version as to one with an opt-out.

The best someone could do is write a service that proxies PyPI but without serving the JSON. But I'm not aware of anyone proposing to do that.

Just to be clear too, the above is true because of the specifics of this bug, but if pip preferred the HTML version, the same general problem could happen in reverse. There's nothing specific to JSON here other than the bug happens to be in the JSON API on PyPI.

This sort of bug is honestly a risk anytime there's some code path that isn't being used, that suddenly starts being used.

I think there just isn't a path forward for people who don't upgrade pip, even without the backfill sooner or later most projects are going to end up with releases with dist-info-metadata.

Hmm, in that case I recommend if anyone has any channels to "get the word out" it be used when Pip posts these bug fixes, I will post it on the Reddit community.

Do we know if our current PEP 658 implementation was tested with any real-world (private) index ? Could we use the same to test these fixes?

proxpi v1.1rc0 1 causes the mentioned error when proxying an index which correctly provides the metadata. This can be achieved with PyPI by commenting-out this line (ie disable JSON-response requests to PyPI).

Footnotes

  1. shameless plug

Just because it has been brought up previously to take on mousebender as a dependency for pip, https://github.com/brettcannon/mousebender/blob/main/mousebender/simple.py had this covered for both response types with tests for the HTML side of things (the JSON side is a TypedDict, so it was implicitly covered).

Do we have a gameplan here? What can we do to unblock this?

There was mention of a patch release of pip, is that still being considered?

Unfortunately, the affected pip 23.0.1 has now been backported to today's Python 3.7, 3.8 and 3.9 releases:
python/cpython#101997
python/cpython#102273
python/cpython#102244
python/cpython#102243

Prior to this backport those release branches had been running pip 22.0.4, which wasn't affected since it didn't include #11111 (since the feature didn't land until pip 22.3).

This presumably means the warehouse fix (pypi/warehouse#13706; which will allow pip and other clients to start benefiting from the feature) is now blocked not only on the pip bug (the bug described in the OP here) being fixed in a new pip bugfix release, but also new CPython releases containing that updated pip too?

This might particularly be an issue for Python 3.7, given it reaches EOL at the end of this month, so I'm guessing there may not even be another 3.7 release after today's 3.7.17 release?

Short of emergency pip/CPython releases, is there any other way the warehouse fix can be unblocked? Does warehouse have access to the pip version via the user agent, and so could use it to only serve the new dist-info-metadata key for fixed pip versions perhaps?

OOF. size large.

The proposal (bifurcation on user agent) is not likely to happen as it will double the number of objects we need to cache in our CDN.

So we're kind of a crummy spot right now.

We have a widely deployed version of pip that contains a bug, that went unnoticed because of a PyPI bug, but we're blocked on fixing that PyPI bug because of the pip bug.

The way I see it we have 3 real options here:

  1. Fix the bug in pip, wait some length of time until the broken versions of pip are "old enough", then fix the PyPI bug and break those versions of pip without a good way to upgrade.
  2. Fix the bug in pip, fix the bug in PyPI, but special case pip itself so it doesn't have the PEP 658 metadata key at all until the broken versions of pip are "old" enough. This breaks those versions of pip, but allows them to easily upgrade.
  3. Amend PEP 691 (or make a new mini PEP if we think it's needed) to rename the key that PEP 658 metadata uses to something completely different. This would allow both pip and PyPI to fix their bugs independently, but means we have to amend the spec.
    • We might want to also amend the HTML representation to match? or deal with them having different names?

I think that (3) is probably our best option here, we could just rename the key from dist-info-metadata to core-metadata, old versions of pip won't take advantage of the key by default, but they were never going to since pip had a bug that prevented that. It avoids a big breaking change and it avoids having to wait some long period of time before we can take advantage of PEP 658.

I agree that (3) seems like the best option. Even if fixed versions of pip were released via new bugfix releases of CPython 3.7/3.8/3.9/3.10/3.11, there will still be some users who are using non-latest patch releases of CPython who would end up being broken by options 1/2.

hugovk commented

This might particularly be an issue for Python 3.7, given it reaches EOL at the end of this month, so I'm guessing there may not even be another 3.7 release after today's 3.7.17 release?

It's possible, but unknown.

A potentially mitigating factor: 3.7-3.10 are source-only releases. Only 3.11 (and 3.12) get binary installers.
https://devguide.python.org/versions/

Unfortunately, the affected pip 23.0.1 has now been backported to today's Python 3.7, 3.8 and 3.9 releases:

There are also 3.10 and 3.11 releases due out today (3.12.0b2 already went out). I don't think the builds have started and there might be still be time to adjust the pip version (e.g. rollback) if you're quick?

A potentially mitigating factor: 3.7-3.10 are source-only releases

True, but I suspect most places are still distributing all supported CPython versions even if they are only source only. For example:
https://hub.docker.com/_/python/tags?page=1&name=3.9

There are also 3.10 and 3.11 releases due out today. I don't think the builds have started and there might be still be time to adjust the pip version (e.g. rollback) if you're quick?

Python 3.10 and 3.11 have been using the affected pip versions for some time (the last release of each was already on pip 22.3.1, and the feature has been in pip since 22.3) - so the situation for those versions wasn't made any worse by today's backport.

Keep in mind that the bundled pips in these releases are only part of the optional ensurepip module which provides a minimum version of pip that can be used as a bootstrap which can then be used to upgrade to the most current version of pip. Most third-party distributors of Python do not include ensurepip in their packages as they provide their own versions of pip. I don't think we need to do anything urgent about this.

Well many people do use the bundled version of pip, I think all of the python docker container versions do, Homebrew, etc.

That being said, I think this bug has existed dormant in pip since at least 22.3, so we're already well past the point where we can avoid having it "out in the wild", so I don't think today's CPython releases matter much at all for our strategy here, other than it would have been nice to get a fixed version into those releases.

I think since (3) is probably our best option, I'm going to raise a topic on discuss.p.o.

Agreed, I think (3) is the best option. I'm inclined to think we should produce a PEP, as there's the possibility that other index providers have started to implement the existing spec - although it's unlikely any have deployed it, as if they had, we'd have got bug reports for pip before now...

pypi/warehouse#13706 is now an implementation of PEP 714 awaiting pronouncement.

I'm looking at a pip fix.

Hmm, there's an important difference between PEP 658 (which specifies how hashes are supplied in the HTML index) and PEP 691 (which specifies the JSON form). The JSON form allows for multiple hashes, which isn't possible in the HTML form. So to correctly process the JSON data will need more than just fixing how the data is parsed, it will need extra functionality to handle multiple hashes. So it's going to be a deeper fix than it looked at first.

You're not required to use all of them, PEP 691:

Multiple hashes can be included, and it is up to the client to decide what to do with multiple hashes (it may validate all of them or a subset of them, or nothing at all).

So you can just pick one of them to verify if that makes it easier.

The infrastructure's all there to handle multiple hashes (the Link.hashes attribute) it's just that the data structures used to parse the dist-info-metadata don't handle multiples. Basically Link.metadata_link is fine, it's the intermediate LinkHash structure that's the limited one. I may just replace that class altogether. (I could just do something specifically for the JSON case, but that would be messy, and I want to keep the implementation clean, as I really don't want to introduce any new problems here).

I'm also thinking we need some much better tests of this, so we can be sure we've actually fixed the issue before saying it's OK for PyPI to turn on their implementation.

Sorry, I shouldn't be trying to pair-program via GitHub issues 😉

The good news is we should be able to turn on PyPI's implementation of PEP 714 before pip lands their implementation, since the whole purpose of the strategy in the PEP is that it enables PyPI to land it without breaking the existing logic in pip. Not that much better tests would be a bad thing of course :) but I know how gnarly pip's code base can be to test too, so there's not a huge rush to land it in pip prior to PyPI anymore.

Hmm, that's a good point and probably worth doing, when we get to that point. It makes me a lot more comfortable that I can check that I haven't made a dumb mistake before we unleash the pip fix on the world 🙂

PyPI's implementation of PEP 714 just landed thanks to @ewdurbin :)

Pip's will land as soon as I can get past the *&^%$£$% RAM disk error on Windows CI.