Plus notation for .whl filenames overlap between pytorch patch versions.

Question

Plus notation for .whl filenames overlap between pytorch patch versions.

alexandervaneck opened this issue 2 years ago · 11 comments

🐛 Describe the bug

As mentioned in #66 (comment) there seems to be a regression of sorts to the + notation for .whl generation.

Copied from other issue

I believe to have found an edge case which resurfaces the original comment of "currently, the filenames are identical (just a different URL path), which confuses pip into thinking there is no need for a re-install"

The torch version in the + metadata is specified up to the minor version and therefore filenames can be identical between torch versions.

In https://data.pyg.org/whl/torch-1.13.1%2Bcu116.html (for 1.13.1)
pyg_lib-0.1.0+pt113cu116-cp310-cp310-linux_x86_64.whl
In https://data.pyg.org/whl/torch-1.13.0%2Bcu116.html (for 1.13.0)
pyg_lib-0.1.0+pt113cu116-cp310-cp310-linux_x86_64.whl

This then resurfaces all the mentioned 'problems' with pip not reinstalling when it clearly should when the torch version (or cuda) version is updated. Could we fix that by setting the torch version to the patch version?

Thank you for looking into this!

Environment

N/A

Answer 1 · 2023-02-16T05:48:00.000Z

They might actually be the same wheel, looks like only 1.13.0 is in the build matrix:

https://github.com/pyg-team/pyg-lib/blob/0.1.0/.github/workflows/building.yml

Answer 2 · 2023-02-16T11:05:01.000Z

That's very interesting. I'm not even sure how 1.13.1 ended up being uploaded as there's no 1.13.1 torch-version specified. 🤔

Answer 3 · 2023-02-17T13:58:10.000Z

We just indeed symlink them since there was a confusion for people that tried, e.g., -f https://data.pyg.org/whl/torch-1.13.1+cpu.html. All wheels are compatible across the a PyTorch patch version.

Answer 4 · 2023-02-18T20:52:17.000Z

would a url scheme in line with the plus notation like https://data.pyg.org/whl/pt113cu116.html make sense? to simplify and avoid confusion

Answer 5 · 2023-02-19T09:37:36.000Z

Yeah, this is a good idea, although we likely still want to maintain old behavior such that something like

import os
import torch
os.environ['TORCH'] = torch.__version__

!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html

still works (so it's not a top-prio for me at the moment).

Answer 6 · 2023-02-19T11:00:16.000Z

Good morning ☀️ . Pardon me for jumping in here 🙇 I believe I can add some extra thoughts and a possible solution to this problem that should solve it once and for all.

Problem statement

I notice that the current setup works well for jupyter notebook users (as provided by your example above) where it is common to install dependencies in separate cells. This allows users to first install torch (for whatever cuda version) and then later use the torch.__version__ in --find-links.

However, when these libraries need to be included in other downstream libraries is becomes more difficult. I maintain a rather large (private) ML library which supports a range of torch versions. The requirements.txt can be thought of as;

torch>=1.12.1,<1.14
torch-scatter~=2.1

Currently it is not possible to install this requirements.txt without hardcoding --find-links to an exact pytorch version upfront.

$ export PIP_FIND_LINKS="https://download.pytorch.org/whl/cu116/torch_stable.html https://data.pyg.org/whl/torch-1.13.1+cu116.html"
$ pip install -r requirements.txt torch==1.13.1

There's two issues with this approach:

We have to commit to a hardcoded torch version, even though we support several.
If we would "forget" to add torch==1.13.1 in pip install there's a chance pip will install a version of torch that is incompatible with the wheels in https://data.pyg.org/whl/torch-1.13.1+cu116.html (pip could pick torch==1.12.1 and we'd install pyg for 1.13.1).

Inspiration

Coincidentally torchvision is in the same situation as pyg libraries. They solve this as follows;

All torchvision packages for cu116 are available in https://download.pytorch.org/whl/cu116/torch_stable.html .
Every torchvision package supplies a Requires-Dist with the torch version it requires.

Note; torchvision doesn't use a + notation because they only ever support 1 version of torch at a time. but there's no reason we couldn't do that.

Possible solution

I think the least confusing method would be to do what torch/torchvision do and rely on pip to figure this out instead of our users. Therefore I propose the following;

Add a Required-Dist with [torch-version].* (f.e. 1.13.*) to every wheel compiled. This will allow pip to understand which packages are compatible.
Add all wheels to https://data.pyg.org/whl/[cuda-version].html (f.e. https://data.pyg.org/whl/cu116.html) for every cuda version used during compilation. Using + notation to differentiate the package file names (torch-scatter+pt112cu116.whl, torch-scatter+pt113cu116.whl).

This then allows:

$ export PIP_FIND_LINKS="https://download.pytorch.org/whl/cu116/torch_stable.html https://data.pyg.org/whl/cu116.html"
$ pip install -r requirements.txt
or
$ pip install torch torch-scatter --find-links "https://download.pytorch.org/whl/cu116/torch_stable.html https://data.pyg.org/whl/cu116.html"

This has some advantages over the current approach:

We use pip's dependency resolver to figure out which dependency is compatible. (with Requires-Dist)
Users don't have to hardcode a torch version in their build process to tell pip what to do.
Users don't have to update their link unless they are changing cuda versions.
No chance of mistakingly installing a technically unsupported torch version with pyg library.
When a patch for torch comes out f.e. 1.13.2 pyg doesn't have to do anything. The package for 1.13.* will be allowed to be installed by pip.

In short

I see two relatively simple pieces of work to make this happen:

Resolve #200 to add Requires-Dist of f.e. torch==1.13.* when we compile any pyg library for that version of torch.
Upload all wheels to a .html per cuda version. f.e. https://data.pyg.org/whl/cu116.html (the + notation already makes every package unique)

Is there anything I haven't considered in this proposal? Your guidance on this topic would be much appreciated! 🙇

(p.s. I'm happy to work on this if either of you are busy at the moment!)

Answer 7 · 2023-02-19T11:33:59.000Z

I'm not sure what pip's behaviour will be when pip sees links to both pyg-42.42+pt112cu116.whl and pyg-42.42+pt113cu116.whl, and your reqs specify only pyg>=42.24. The hope is that pip will inspect both wheels with same version but different plus notation, to check for compatibility with e.g. pytorch==1.13.0. Not sure how to confirm that behaviour without actually implementing it 🤔

There is another potential issue, where I'm afraid you'll have to hardcode the plus-notation in your reqs.txt: if pyg==42.42+cpu is already installed in your env, and you do a pip install -f cu116.html pyg==42.42, there wont be a reinstall for cu116, pip will be happy as-is. If you add different pytorch versions to the mix, this gets even more iffy.

For ray, the docker image for gpu is based on the cpu image, so there you have this scenario and you have to hard code plus notation for the second install: ray-project/ray#26072

Answer 8 · 2023-02-19T12:37:47.000Z

I'm not sure what pip's behaviour will be when pip sees links to both pyg-42.42+pt112cu116.whl and pyg-42.42+pt113cu116.whl, and your reqs specify only pyg>=42.24. The hope is that pip will inspect both wheels with same version but different plus notation, to check for compatibility with e.g. pytorch==1.13.0. Not sure how to confirm that behaviour without actually implementing it 🤔

I'll do some testing this week to see. 🤔 I'm also interested. Probably I can compile pyg twice with Requires-Dist and then check by directing pip to the dist dir.

There is another potential issue, where I'm afraid you'll have to hardcode the plus-notation in your reqs.txt: if pyg==42.42+cpu is already installed in your env, and you do a pip install -f cu116.html pyg==42.42, there wont be a reinstall for cu116, pip will be happy as-is. If you add different pytorch versions to the mix, this gets even more iffy.

Thank you for pointing this out. My hope is that we could do Requires-Dist: torch==1.13.*+cu116 and that pip will understand that. But I will do some testing 🤔

Answer 9 · 2023-02-19T13:24:10.000Z

There is another potential issue, where I'm afraid you'll have to hardcode the plus-notation in your reqs.txt: if pyg==42.42+cpu is already installed in your env, and you do a pip install -f cu116.html pyg==42.42, there wont be a reinstall for cu116, pip will be happy as-is. If you add different pytorch versions to the mix, this gets even more iffy.

My hope is that we could do Requires-Dist: torch==1.13.*+cu116 and that pip will understand that.

To solve this we should do Requires-Dist: torch<1.14+cu116,>=1.13.0+cu116 in pyg. At least with this addition pip will have a chance to figure it out. It will depend on how pip was set up.

$ pip install 'torch<1.14,>=1.13.0+cu116' --find-links https://download.pytorch.org/whl/cpu/torch_stable.html
# installs cpu, but should've errored out.
$ pip install 'torch<1.14,>=1.13.0+cu116' --extra-index-url https://download.pytorch.org/whl/
# install cu116

To follow your example

$ pip install pyg==42.42 --extra-index-url https://download.pytorch.org/whl/ --find-links cu116.html
# should install 'torch<1.14,>=1.13.0+cu116'  & pyg==42.42+cu116

I've opened an issue for https://download.pytorch.org/whl/cpu/torch_stable.html to also include local version identifiers (see pytorch/pytorch#95135) so that both of them will start working as expected.

@ddelange do you agree that this is a safe change? I really would like to see this as two separate issues. Adding Requires-Dist seems straight-forward and doesn't make things worse, only adding information for pip.

How we could put those wheels together on 1 .html page and how pip would behave is secondary IMO.

Answer 10 · 2023-02-19T14:00:17.000Z

I'm sorry for the following; I'm doing some testing right now and coming up with very strange results on how pip looks at local version identifiers. Disregard the previous comment. :)

I'll come up with some clear proof that this would or would not work. 🙌 Thank you for the questions!

Answer 11 · 2023-02-19T14:56:09.000Z

afaik, if you want to specify a local version label in your package's Requires-Dist (or in a requirements.txt passed directly to pip), you can only do so in a pinned/exact manner.

You can not use wildcards in neither the version, nor in the local version label:

$ pip wheel --no-deps 'torch==1.13.*+cu116' --extra-index-url 'https://download.pytorch.org/whl/cu116'
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/cu116
ERROR: Could not find a version that satisfies the requirement torch==1.13.*+cu116 (from versions: 1.4.0, 1.5.0, 1.5.1, 1.6.0, 1.7.0, 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.0+cu116, 1.12.1, 1.12.1+cu116, 1.13.0, 1.13.0+cu116, 1.13.1, 1.13.1+cu116)
ERROR: No matching distribution found for torch==1.13.*+cu116

And you can not specify version ranges in combination with a local version label. It looks like the specified local version label will simply be ignored (here it downloads a cu116 wheel):

$ pip wheel --no-deps 'torch>=1.12,<=1.13.2+cpu' --extra-index-url 'https://download.pytorch.org/whl/cpu' --extra-index-url 'https://download.pytorch.org/whl/cu116'
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://download.pytorch.org/whl/cpu, https://download.pytorch.org/whl/cu116
Collecting torch<=1.13.2+cpu,>=1.12
  Downloading https://download.pytorch.org/whl/cu116/torch-1.13.1%2Bcu116-cp38-cp38-linux_x86_64.whl (1977.9 MB)

So your hands are kind of tied when it comes to local version labels on your side I think.

The hope is that pip will inspect both wheels with same version but different plus notation

If this is indeed the case, a PR @ pyg could make sense to add pytorch ranges (without local version labels) to the wheels, such that your package doesn't need to specify any local version labels, and your users can control cuda/cpu exclusively over the URLs passed to pip.

The problem I mentioned with existing installations still persists though, not sure whether that is a pip bug or feature 😅