pypa/packaging-problems

Campaign to get people Publishing Wheels

dstufft opened this issue ยท 133 comments

How can we get more people to publish Wheels, especially for Windows? Christoph Gohlke published Windows installers but that won't work for Wheel because he won't have rights to upload them.

Perhaps the Build farm I've wanted to do can be used here?

alex commented

http://pythonwheels.com/ is an attempt at this, now that pip 1.5 installs them by default this should be easier.

I think one part of this would be to make the setup.py ... package uploading process more streamlined and do the right thing.

kura commented

I was thinking surely it makes sense for a simple "build" command that made wheels, eggs and an sdist by default? Rather than having to specify each one separately?

Am I wrong in thinking that you still need to install another package just to create wheels?

alex commented

Yes, you need to pip install wheel before setup.py bdist_wheel works. Also, you really shouldn't be making eggs ;)

As of 2015 Christoph Gohlke publishes wheels rather than msi installers http://www.lfd.uci.edu/~gohlke/pythonlibs/

@scopatz is this something you could comment on?

Thanks for roping me into this issue @brainwane.

I am speaking a on behalf of conda-forge here. But basically, we'd love it if conda-forge could be used to build & publish wheels. To that end, it might be more useful to think of conda-forge as just "The Forge."

We have the infrastructure for building binary packages across Linux, OS X, Windows, ARM, and Power8 already. We have a tool called conda-smithy that we develop and maintain that helps us keep all of the packages / recipes / CIs configured and up-to-date.

I see two major hurdles to building and deploying wheels from conda-forge. These could be worked on in parallel.

Building: conda-smithy would need to be updated so that packages that are configured to do would generate the approriate CI scripts (from Jinja templates) to build wheels. This would be CI-provider and architecture specific. Probably the easiest place to start is building from manylinux on Azure. We would probably need at least one configuration variable to live in conda-forge.yml that actively enables wheel building (enable_wheels: true? enable_wheels: {linux-64: true}?). Conda-smithy reads this file when it rerenders a feedstock (a git repo with a specific structure for building packages). There are probably some subtleties and difficulties here with working through which compiler toolchains should be used on different platforms (there is really only the manylinux standard for linux). But this is the basic idea.

The challenge with building is that most of the conda-forge people are not used to building wheels. I am happy to help work on the conda-forge infrastructure side, but I think we need someone who is an expert on the wheels side who is also willing to jump in and help scale this out with me.

Deploying: Once we can build wheels, we need a place to put them. Nominally, this would be PyPI. But we need to be able to do this from a CI service. We are happy to have an authentication token that we use. There isn't much that I see that conda-forge can really do about this (which has prevented us from working on this issue previously). However, I think that the PyPI is working on this.

I am super excited about this; the fundemental premis of conda-forge is to be open source, cross platform, community build infrastructure. If there are other folks out there who are enthusiatsic about getting this working, please reach out to me or put me touch!

Thanks @scopatz! @waveform80 and @bennuttall would you like to speak from the piwheels perspective? And @jwodder, from what you have learned via Wheelodex? (Found out about you via this thread.)

Perhaps the work that @matthew-brett did at MacPython to build wheels of key packages of the Scientific Python stack will be helpful as well. Also, I discovered cibuildwheel by @joerick recently. (Edit: wrong Matthew Brett)

For the piwheels project we build arm platform wheels for the Raspberry Pi, built natively on Raspberry Pi hardware, on piwheels.org we don't try to bundle dependencies ala manylinux2010, instead we target what's stable in the distro (Raspbian) and make no promises elsewhere. The project source itself is open, so others could run their own repos targeting other platforms.

I don't recommend maintainers upload arm wheels, and instead let us build them knowing they work on the Pi.

We also attempt to show library dependencies on our project pages e.g. https://www.piwheels.org/project/numpy/ rather than let people work them out e.g. https://blog.piwheels.org/how-to-work-out-the-missing-dependencies-for-a-python-package/

Hi @scopatz, what do you propose to do about shared libraries that have no natural place in a wheel (to me, most shared libraries have no natural place in a wheel).

We cannot stick our heads in the sand on that. That we use shared libraries heavily in conda is one of our most compelling advantanges and because we use the same ones across languages, putting those shared libraries in a wheel would be a bad thing to do.

I'm not coming with a solution here. I wish I were, I really do.

It would probably be neater not to ship the external libraries in the wheels, but it has in practice been working, at least on Linux and macOS.

I can see the problem is more urgent for Conda, because y'all are building a multi-language software distribution.

A few years ago, @njsmith wrote a spec for pip-installable libaries: pypa/wheel-builders#2

It isn't merged, and it looks like 'the current setup works for me' has meant that no-one thus far has had the time or energy to work further on that. I suspect something on that line is the proper solution, if we could muster the time.

By the way - @scopatz - I'm happy to help integrating the wheel builds into conda forge - but I'm crazy busy these days, so I won't have much time for heavy lifting.

It would probably be neater not to ship the external libraries in the wheels, but it has in practice been working, at least on Linux and macOS.

Well, the software needs to work of course and I'm not being facetious!

We end up discussing where the line is between the thing itself and the system libraries that support it, and that's not clear cut. Take xgboost as an example. It has a C/C++ library and bindings for Python and R. Now xgboost itself builds static libs for each so they sidestepped that issue while we're much more efficient (n many dimensions). Now libxgboost is clearly a part of the xgboost stack, but what about ncurses? Is it system or not? In conda-forge, we provide it, and in all honesty that line is organic and something we move as and when we find we need to.

@brainwane @scopatz if there's a better title for this issue today, could you change it/comment so that someone else who can make the change, changes it?

I can offer mild packaging familiarity, reasonable python / CI / cloud experience and say 10-20 hours a week for the next month if it would be helpful. I think I would be a good fit if there's a rough consensus on direction and pypa/conda experts available for consulting but bottlenecked on elbow grease

cc @brettcannon @dstufft @asottile

@matthew-brett I thought Carl Kleffner did something similar to a pip installed tool chain with openBLAS for NumPy though my memory might be foggy

@mikofski - right - Carl was working on Mingwpy, which was (still is) a pip-installable gcc compiler chain to build Python extensions that link against the Python.org Microsoft Visual C++ runtime library.

Work has stalled on that, for a variety of reasons, although I still think it would be enormously useful. I can go into more details - or - @carlkl - do you want to give an update here?

@mattip - because we were discussing this a couple of weeks ago.

It would probably be neater not to ship the external libraries in the wheels, but it has in practice been working, at least on Linux and macOS.

I can see the problem is more urgent for Conda, because y'all are building a multi-language software distribution.

A few years ago, @njsmith wrote a spec for pip-installable libaries: pypa/wheel-builders#2

It isn't merged, and it looks like 'the current setup works for me' has meant that no-one thus far has had the time or energy to work further on that. I suspect something on that line is the proper solution, if we could muster the time.

I don't know if we have a clear answer that pip should be used as a general-purpose packaging solution. My view which seems to be shared by several others from the recent discourse discussion about it is that it should not try to "reinvent the wheel" or replace general purpose packaging solutions (like conda, yum, apt-get, nix, brew, spack, etc...), pip has clear use as a packaging tool for developers and "self-integrators".

For that use case, statically linking dependencies into a wheel (vendoring native dependencies) can be a stop-gap measure but become very difficult for distributors as evidenced by pytorch, rapids, arrow, and other communities. It is definitely not ideal and in-fact a growing problem with promoting the use of wheels for all Python users.

Using pip to package native libraries is conceivably possible, but a bigger challenge than it seems at first. It is hard to understand the motivation for this considerable work when this problem is already solved by several other open-source and more general-purpose packaging systems.

A better approach in my view is to enable native-library requirements to be satisfied by external packaging systems. In this way, pip can allow other package managers to install native requirements and only install wheels with native requirements if they are already present.

Non-developer, end-users who use Python integrated with many other libraries (such as the PyData and SciPy users) should be also be encouraged to use their distribution package manager to get their software. These distributions (such as conda-forge) already satisfy robustly the need for one-command installation. This is a better user-experience than encouraging these particular users to "pip install"

In sum: conda-forge infrastructure producing wheels is a good idea, conda-build recipes producing wheels that allow for conda-packages to satisfy native-library dependencies is an even better idea.

@teoliphant While theoretically a reasonable idea, this ignores the fact that a significant number of users are asking for pip-installable versions of these packages. Ignoring those users, or suggesting that they should "just" switch to another packaging solution, is dismissing a genuine use case without sufficient investigation.

I know from personal experience that there are people who do need such packages but who can't or won't switch to Conda (for example). And on Windows there is no OS-level distribution package manager. How do we serve such users?

What difficulties are pytorch, rapids, arrow having? I'm happy to advise.

For arrow, I think it's best summarized here:

https://twitter.com/wesmckinn/status/1149319821273784323

  • many C++ dependencies
  • several bundled shared libraries
  • some libraries statically linked
  • privately namespaced, bundled version of Boost

@wesm - I'm happy to help with this - let me know if I can. Did you already contact the scikit-build folks? I have the impression they are best for C++ chains. (Sorry, I can't reply on Twitter, have no account).

wesm commented

I believe we have one of the most complex package builds in the whole Python ecosystem. I think TensorFlow or PyTorch might have us beat, but it's close (it's obviously not a competition =D).

I haven't contacted the scikit-build folks yet, if that could help us simplify our Python build I'm quite interested. I'm personally all out of budget for this after I lost a third or more of my June to build and package-related issues so maybe someone else can look into it

cc @pitrou @xhochy @kszucs @nealrichardson

Thanks - that sounds very tiring. I bet we can use this as a stimulus to improve the tooling. Would you mind making an issue in some sensible place in the Arrow repositories for us to continue the discussion?

I'll echo what @wesm said here. I spent a lot of time as well trying to cope with wheel packaging issues on PyArrow. I'd be much happier if people accepted to settle on conda for distribution and installation of compiled Python packages.

(disclaimer: I used to work for Anaconda but don't anymore. Also I own a very small amount of company shares)

@pitrou - I hear the hope, but I really doubt that's going to happen in the short term. So I still think the best way, for now, is for those of us with some interest and time, to try and improve the wheel building machinery to the point where they are a minimal drain on your development resources.

wesm commented

Just to drop some statistics to indicate the seriousness of this problem, our download numbers are growing to the same magnitude as NumPy and pandas

$ pypistats overall pyarrow
|    category     | percent | downloads  |
|-----------------|--------:|-----------:|
| with_mirrors    |  50.18% |  9,700,974 |
| without_mirrors |  49.82% |  9,630,781 |
| Total           |         | 19,331,755 |

$ pypistats overall numpy
|    category     | percent |  downloads  |
|-----------------|--------:|------------:|
| with_mirrors    |  50.15% | 114,356,740 |
| without_mirrors |  49.85% | 113,661,813 |
| Total           |         | 228,018,553 |

$ pypistats overall pandas
|    category     | percent |  downloads  |
|-----------------|--------:|------------:|
| with_mirrors    |  50.12% |  67,694,077 |
| without_mirrors |  49.88% |  67,358,042 |
| Total           |         | 135,052,119 |

One of the reasons for our complex build environment is that we're solving problems that are very difficult or impossible to solve without a deep dependency stack. So there is no end in sight to our suffering with the current state of wheels

Did you already contact the scikit-build folks? I have the impression they are best for C++ chains

I believe conda is the best for C++ chains but I would say that.

OpenCV seems to have a similar problem to pyarrow. They have many dependencies and are c++ based. The opencv-python repo builds wheels by using upstream opencv as a submodule. It uses scikit-build to build various variants of the package as a single c-extension module. Most of the dependencies are statically linked into the shared object. Maybe worth a look if you want to target wheels. The separation between the OpenCV repo and the packaging repo is particularly attractive since the CI runs and testing can be separate.

I believe conda is the best for C++ chains but I would say that.

Yes, sorry, I don't have an informed opinion on Conda and C++ - I only meant 'best of the Wheel meta build tools, if you have C++ chains'.

Why is everyone so keen to fill their computer memory up with mutliple copies of the same functions?

We do tend to build out static libs for all of our packages, but the problem is that things need to be rebuilt twice, once using shared libs for conda and once using static libs for wheels.

Problem is you're then at a point where the difference in effort between building a wheel and a conda package may shrink (in the wrong direction).

Why is everyone so keen to fill their computer memory up with mutliple copies of the same functions?

I hope it's a sarcastic comment. What many people apparently want though it's to be able to pip install everything, which might be a legitimate request or not (no comment on that). The Jupyter folks already gave up on this (not familiar with the reasons though) and now the extensions must be built using npm, which of course you can't pip install (yet?).

I hope it's a sarcastic comment

It is a rhetorical question, not a sarcastic comment. And it is entirely legitimate.

The thing that is being skirted around here is the fact that wheels are hard to build because they link to static libraries, static libraries are exteremely fiddly for build systems (and humans) to work with.

The work done to build a shared library is reused every time that shared library is loaded. The work done to build something with a static library is repeated with every package.

That work is also horribly complicated, if the same static lib gets linked twice into the same Python extension module you get symbol name clashes. Often you have to deal with static libs built with one build system being consumed by another and for that to work they need to understand, to some extent, the C-level packaging meta data from each other (so CMake needs to know some things about libtool and pkg-config for example).

I believe this is the crux of why building packages for conda is easy and building complex packages for PyPI is not and there's not a great amount of tooling that can be done around that problem.

Why is everyone so keen to fill their computer memory up with mutliple copies of the same functions?

I don't know whether have any numbers on that - I don't - but I'm betting that the extra memory is maybe in the order of 30M in typical PyData usage, which might worry me on my first-generation Raspberry Pi,. not not on my Intel laptop.

which might worry me on my first-generation Raspberry Pi,. not not on my Intel laptop.

Sorry @matthew-brett, I disagree with this. Software should scale well and be efficient everywhere, otherwise you're killing the planet (without getting too high up on my moral horse!)

Yes, sorry, I'm only saying the 30M doesn't worry me, personally, on the Intel machines that I use; I get that it worries you, and I can see that would have an effect on your choice of packaging tool.

I'd be much happier if people accepted to settle on conda for distribution and installation of compiled Python packages.

I think the question of "conda vs wheels" keeps coming up, but it seems to me that it's a bad case of comparing apples and oranges. My personal problem with switching to conda is that it isn't just a choice of packaging tools - there's a lot of other baggage as well. (Disclaimer: I haven't tried using conda for a while now, although I made a few attempts in the past, so my data may be out of date). For example:

  • Conda manages my Python interpreter, so it's not immediately obvious to me how I'd use a new python.org release (or a beta, or a personal build).
  • Conda has its own virtual environment solution, meaning I don't know how it interacts with things like pipenv, or pew, or pipx (or, for that matter, tox). And if I decide to try it anyway, I'm not at all clear who I'd look to for support if something didn't work as expected.
  • Conda builds are handled by people other than the upstream projects, so if the conda build people haven't packaged a project, or they haven't packaged the version I want to use, I'm back to needing "another solution" (at which point we're back to does pip integrate with conda, and if I have to use pip for some of my dependencies, why can't I just use pip for everything).

If conda offered just a package management solution, it would be much easier to argue that people wanting wheels for hard-to-package projects should just use conda. But when a switch to conda involves such a significant change to the user's whole working environment, it's a much harder sell (and it's much more important to take the comments of people saying "I can't use conda for my situation" seriously).

Further disclaimer: I am a pip maintainer, so the above is not unbiased. But I have heard similar comments from a colleague who has no particular reason to prefer either option.

wesm commented

I agree with @msarahan -- the problem is that we are completely on our own to manage a standalone library toolchain that has to be shipped in a wheel using a mix of static linking, bundling shared libraries, and private namespacing.

The configuration of the wheel build is highly bespoke because of the requirement that the wheel be self-contained and as isolated as possible from the user's system. As an example, because we are now shipping gRPC symbols in our wheels (for our Arrow Flight messaging system), we now need to bundle or statically link OpenSSL to provide TLS support to users. I have no idea what will happen when users are using Google's Python gRPC wheels in the same process.

We are also statically linking:

  • LLVM
  • Snappy
  • bz2
  • lz4
  • zstd
  • Protocol Buffers (dependency of gRPC)
  • c-ares (dependency of gRPC)
  • Apache Thrift C++
  • uriparser
  • re2
  • glog
  • gflags
  • double-conversion
  • Brotli

When I tell people about the dependency stack, the knee-jerk reaction is "But do you really need $LIBRARY...?" We aren't adding these dependencies frivolously -- they are used to solve real world problems.

Just building our wheels and getting them to work everywhere is hard enough, but we've also had to contend with the non-compliant wheels from PyTorch and TensorFlow that result in numerous in-bound bug reports

Were it not for our generous corporate sponsorships, keeping this project going from an operational standpoint probably would not be possible. But we are spending a disproportionate amount of time maintaining packages (wheels being by far and away the worst offender), causing said sponsorships to be at risk in the medium term if we aren't able to spend more time focused on building new features.

@wesm - I absolutely see the problem - and I hope very much we can remove a large proportion of that pain from your build process, by some collaboration on improving tooling. I'm not saying that will definitely work, only that we should definitely try.

That said, you do have a problem of nearly unique complexity (competing in this respect with tensorflow). It is true that, if all packages were in your situation, the current vendor-your-libraries situation in wheels would make wheels in general unworkable.

On the other hand, wheels do work, with a small and acceptable maintenance burden, in many cases.

Those of us who want to continue to use pip have a strong motivation for trying to help you out of the hole that that the current system has put you in.

I think the question of "conda vs wheels" keeps coming up, but it seems to me that it's a bad case of comparing apples and oranges.

I don't want to be insulting, but it's rather comparing good apples and rotten oranges.
The people complaining are people who have had a forced long, painful experience building wheels. The people saying everything is fine are people who seem to only be building simple, trivial packages.

Personally, I am extremely annoyed by the self-complacency of the so-called Python Packaging "Authority". You want to call yourself an Authority but you don't seem to have the Competence that goes with it. That's a fundamental issue here. @ncoghlan

The people saying everything is fine are people who seem to only be building simple, trivial packages.

Building simple, trivial packages, is pretty simple, and trivial.

Building moderately complicated packages, like Scripy, Matplotlib and Pillow, is really not too bad.

Building complicated packages like VTK, ITK, was hard work. I didn't help with those, and I don't know how much of an ongoing burden that is.

I think we do all agree that Arrow and Tensorflow are at the very extreme end of difficulty. So we have a fairly typical situation where a tool works pretty well for the large majority of packages, but is very hard to use for the most difficult packages.

But conda works equally well or better in all cases... so? Why do we have to cope with an inferior standard? Just so that the wheel designers don't lose face?

wesm commented

We certainly have the option to give up on wheels (and at this point, I would say good riddance). The problem is that the users will accuse us of being lazy instead of asking whether wheels are the right place to deploy the software.

Exactly, that's the problem. And that's why I think "Campaign to get people Publishing Wheels" is a nuisance to packagers :-/

@pitrou - you're saying "please give up using pip and use conda instead", but I honestly think there isn't any appetite for that discussion here. Happy to be corrected, but if so, let's move that off to a different place. Do you want to do that? I'm happy to join that, wherever it may be. Assuming / hoping that we aren't having that discussion, then the question is, what can we do to make Arrow's build process practical. I can well see that using conda-forge's static libraries could be a good solution.

Happy to be corrected, but if so, let's move that off to a different place. Do you want to do that?

There is no need for a discussion. All the arguments have been given. The remaining TODO item is for the Python Packaging "Authority" to stop recommending building wheels (see issue title) and recommend conda instead.

wesm commented

The ideal solution, honestly, is to make pip work like conda with respect to C/C++ shared library dependencies. Rather than making us vendor / bundle / statically-link things, manage the dependencies with the package manager. And preferably spread out the burden of maintaining the builds of those dependencies. Exactly what conda-forge is now.

If the solution were wholly endogenous to the PyPA, would it be easier to accept?

I don't know if this can happen though. I hope you would consider it, though

There is no need for a discussion.

That's fine, honestly, and I am glad that it has all been explained.

I don't know if this can happen though. I hope you would consider it, though.

I think this is an option, and it has been discussed - you've probably seen the links further up, that I posted. The problem is that, for various reasons, the mechanics of doing that stalled at the proposal. I think one of the reasons is that, so far, very few packages have really needed that functionality. You might well be such a package - so then there is just the work of agreeing that the problem can't be solved adequately in another way, and getting down to doing the work.

wesm commented

I would hesitate strongly to put conda-forge beneath the PyPA umbrella,
because it is conceptually larger than python. It's definitely made up of
more python-centric people right now, but the community also spans R and
perl, and need not be limited to any language. I think conda-forge would
be happy to have PyPA help advise and guide based on the needs of the
python community, though.

I'm not suggesting to do that. I'm suggesting having a sort of smaller "pip-forge" but otherwise serves the same semantic purpose as conda / conda-forge. The PyPA would need to create tooling to enable non-Python packages to be pushed to the PyPI and declared as build and runtime dependencies of Python wheels.

I'm as wary as anyone about the potential governance issues with conda (see, e.g. https://wesmckinney.com/blog/conda-forge-centos-moment/), but we have to pick our battles.

but we have to pick our battles.

Yes, that's true - and the arguments you're making are one reason that I care a lot about making sure that Pip is a viable alternative to Continuum / Anaconda / Conda.

What I was proposing was that on the build-side, conda's
foundation can simplify the build process, ultimately leading to wheels
that people install without knowledge of them being produced by conda
recipes.

That certainly sounds like something worth exploring. IIUC, it's similar to what @scopatz was suggesting when this thread got restarted a few days ago.

If the solution were wholly endogenous to the PyPA, would it be easier to accept?

Personally, I would be happy to accept an alternative to wheels (regardless of where it comes from) if there was community consensus that it was better. But it's difficult to see what's being proposed at the moment - no-one has addressed the (entirely legitimate, IMO) questions about how to "use conda" for package management without buying into the whole conda ecosystem. And I haven't seen another proposal yet.

Personally, I am extremely annoyed by the self-complacency of the so-called Python Packaging "Authority".

First of all, as far as I am aware, the "Authority" in the name was originally intended as a somewhat ironic overstatement. These days it seems to be giving the wrong impression, and I'd be quite happy to change it (if it were my decision to make). But I doubt it's the real issue here.

As far as I know, no-one is dismissing the genuine difficulties with wheels as a binary distribution format in complex cases. But as has been noted, it does provide a good solution for a majority of packages. We're trying to work on the difficult areas, but we're limited in resources. And to my knowledge no-one has suggested any viable alternative. If you consider "use conda" to be an alternative, then why doesn't a significantly higher proportion of Python users use conda already? It's certainly an option if people want it. Are you seriously suggesting that the PyPA wields that much power? I wish...

Anyway, I'm not helping this discussion, so I'll stop participating at this point. Please continue without me.

The remaining TODO item is for the Python Packaging "Authority" to stop recommending building wheels (see issue title) and recommend conda instead.

We're never going to recommend conda over pip + wheels as long as conda can't install into arbitrary target Pythons on arbitrary systems.

If some package can't be built as a wheel for some reason, or the effort to do so is so high that it's not worth it, then it is perfectly fine for that project to just not publish wheels. That project is also free to tell people that they have packages through whatever distribution mechanism they wish to use, conda included.

The entire point of building wheels to be a recommendation, instead of a requirement, is specifically so that edge cases where build wheels is impossible or impractical for one reason or another, those projects are not somehow at a disadvantage (other than a "natural" one whereas the more complicated it is to install your project, the more people are going to complain and run into issues trying to install it).

If the solution were wholly endogenous to the PyPA, would it be easier to accept?

I don't think we care about who writes or "owns" the solution. More important is ensuring that the solution doesn't tie us to one particular ecosystem. For instance, a solution that only works with apt-get and bakes the assumptions that apt-get is going to exist and be there would not be acceptable, nor would one that mandated end users are using conda, or yum, or MSI installers, or Homebrew, or anything.

Off the top of my head I can only think of two real solutions:

  • Continue to require that wheels remain self contained besides the handful of system provided libraries.
  • Implement some form of metadata that allows a project to declare a dependency on an "external" dependency, that pip won't attempt to satisfy itself and will either simply provide and advisory message or will directly invoke some other tool to do the install.

If there's another solution here that doesn't boil down to "lock all of the users into one platform" then I think we'd be happy to discuss that as well.

Of course with any solution here we can leverage other improvements regardless of what solution we pick. An example here is that regardless of solution, we can probably leverage conda-forge to provide the actual mechanics of producing builds and publishing them to PyPI (opt in for each project of course).

Ultimately what most of the ways forward lack is someone willing to pick a solution, advocate for it, get the relevant stakeholders on board, implement, and drive it to completion (of course those don't all have to be the same person). Without that, it's mostly going to be people sort of vaguely talking past people.

First of all, as far as I am aware, the "Authority" in the name was originally intended as a somewhat ironic overstatement.

I don't think that was ever obvious to anyone but the people who devised that name. For better or worse, calling yourself an Authority means that people will think you are an official authority.

I don't think that was ever obvious to anyone but the people who devised that name

Quite possibly. I never said it was a good joke/idea.

wesm commented

I wasn't aware that it was a joke.

I'm not sure that anyone who originally made the joke is still even active in the packaging tooling. I know it certainly predates my time, and from Paul's comments it sounds like he wasn't involved in that and I think we're two of the oldest still active pip developers?

Well, if you call it "blessed packaging and distribution toolchain", it doesn't change much in reality. It's still authoritative...

Unfortunately, conda-forge does not always build static libraries, and it is not always easy to obtain static library builds of every project. The proposal discussed at SciPy was to vendor shared libraries from conda packages to create the wheel, not to supply static libraries at build time for the wheel. Clearly, static libraries will make much more size-efficient wheels, and definitely simplify any of the questions about symbol mangling, but it means that conda-forge will effectively need to maintain a complete stack of both dynamic and static libraries.

I'm skeptical that static libraries will be much more space efficient in most cases. My understanding is that they're only more space efficient when the vendored project is carefully designed to split up their symbols into separate files, and your package is only using a subset of the project's functionality. That might well be true in some important special cases like LLVM, but I doubt it's true in the average case. This seems like something to worry about later, after things are basically working.

On Linux, auditwheel pretty much solves the technical problems in vendoring DSOs. If we want to move this forward, then I think the first priority is to extend auditwheel to handle macOS and (most of all) Windows.

The ideal solution, honestly, is to make pip work like conda with respect to C/C++ shared library dependencies. Rather than making us vendor / bundle / statically-link things, manage the dependencies with the package manager. And preferably spread out the burden of maintaining the builds of those dependencies. Exactly what conda-forge is now.

This is technically doable without any changes in pip or even talking to the PyPA at all โ€“ that's what the "pynativelib" proposal is about. Of course putting together a sustainable ecosystem of packages is a ton of work, as the conda-forge folks know, and so far no one has been interested in doing the initial work to enable it.

But, is your problem actually with shipping wheels that have vendored libraries in them, or is it with maintaining the system to build those vendored libraries? If there was a tool that took your conda package, bundled it up with all the libraries it depends on from other conda packages, and automatically spit out a self-contained wheel suitable for upload to pypi, would that solve your problem?

wesm commented

But, is your problem actually with shipping wheels that have vendored libraries in them, or is it with maintaining the system to build those vendored libraries?

Both areas have caused us a great deal of hardship.

Clearly, static libraries will make much more size-efficient wheels

Can you explain this? Every statically linked wheel will be bigger, the code for e.g. std::deque::pop_front will exist, literallly, as code in every wheel that uses C++ (modulo the quality of your strip tool and/or linker). This is the inefficiency that at core, conda solves and it solves it for all the languages at once and all of the OSes at once.

alex commented

You're going to get std::deque::pop_front in every wheel no matter what -- STL data structures are header only.

Hah, that's a good point, but substitute it for something that's not in headers.

Autoloaded, vendored DSOs would get loaded once only per SONAME. That is a very important point. You vendor the same DSO twice and guess what happens when you switch import order (those should be sorted alphabetically some say!)

@mingwandroid That's why auditwheel rewrites vendored DSOs to give them unique SONAMEs on Linux, and a hypothetical future Windows auditwheel would need to do the same. (macOS works a bit differently and SONAME collisions aren't a thing there.)

The effect is that within a single wheel, you can have multiple extensions that all use the same vendored DSO, but different wheels don't interfere with each other. (And actually, auditwheel's soname mangling algorithm uses a hash of the DSO, so it's possible for two wheels to share the same in-memory DSO if they both vendored exactly the same DSO. Probably this happens fairly often for, say, numpy and scipy both vendoring the same build of openblas. But this is a minor, best-effort optimization.)

The fundamental problem is that what most people want is something more like conda (i.e. a general-purpose packaging solution), rather than a language-specific packaging solution.

I don't think that's true. I think a huge majority of Python users only expect their package tool to install Python packages.

I do teach R, and I think I'm right in saying that I've never seen an R person recommend Conda to install R packages.

That's part of the root of this problem - any packaging solution that aims to be that general, will need a powerful organization to support it. It can't be just Python programmers or just R programmers or just Linux users or just Windows users. In practice that level of integration and that amount of work, will need a large amount of money and staff, and that in turn will raise the kind of governance issues that Wes referred to.

@njsmith - you mentioned the need for a macOS version of Auditwheel - is that because the tool we're currently using - delocate - does not do what we need? What does it lack?

@matthew-brett Mostly I just think that if our goal is to leverage conda-forge to get a uniform cross-platform workflow for wheel building, then as part of that we'll want a uniform cross-platform workflow for handling vendoring. I said auditwheel because I'm more familiar with it and the name is more generic, but I'm not attached to that name in particular, I just think it's confusing to ask maintainers to learn multiple tools to deal with different platforms.

I think at this point we've long reached the part where we're just a circular firing squad firing shots past one another.

For folks who are engaging this in good faith AND who are looking for something actionable to do to make this situation better, here is what I would suggest you do:

Pick your pain point, figure out which solution to that pain point you personally like best, keeping in mind that while incremental changes aren't the only ones possible, that the more radical a change and the larger the amount of work required, the more likely it is to stall.

If it's something like pynativelibs or a conda2wheel tool, you don't require any special blessing from anyone, and you can just go ahead and start working on it, feel free to ask for feedback or help but don't expect it to become the recommended thing until there has at least been something working and people seem to generally like the idea.

If it's something like external requires that get auto installed or an advisory message or something, then your first step is going to be producing a PEP that details the idea, goes into the tradeoffs, what other solutions you thought of and rejected, etc. You'll have to champion and drive discussion of your PEP, hopefully to acceptance.

I personally would say that pretty much anything is on the table for changes we can make, even fairly radical changes, but we also have to keep in mind that our use cases are often times a lot wider than a lot of other tools (at least in some dimensions) have to deal with (for instance, Windows/Mac/Linux is pretty typical, but what about FreeBSD, OpenBSD, QNX, AIX? We get install traffic for those OSs so that's not entirely hypothetical).

Basically, the sky's the limit if you can come up with a detailed proposal that has a compelling enough argument for why the trade offs make sense and the willingness to put in the effort to make it happen once there's broad agreement.

@dstufft I have two ideas aiming to prevent post-release or post-"pypi upload" and surprise-"cannot load shared library" issues:

  • a quite easy one: an auditwheel validate command which exits with a nonzero code, not just printing the found problems.

  • a bit more complicated: I'd like to test the (somehow) produced wheels in a mostly standardized environment on windows, multiple linux distros and osx. This testing should happen before uploading possibly deficient libraries.

One of CF big advantages is that the produced artifacts are tested before uploading to anywhere, most of the cases in the simplest manner, just import mypkg which would already help to catch a majority of linking issues.
I'm not expecting a fully fledged and hosted infrastructure where I can upload my wheels, in the first round a git repository template would be enough containing CI configurations for most of the platforms. I'd be willing to fork it, configure all of the CIs and keep it updated with the upstream changes in exchange of an environment where the wheels are supposed to work.

wesm commented

I just wrote to the Apache Arrow mailing list indicating my / my team's intention to disengage from further wheel maintenance for the project -- maybe other volunteers can get involved to help

https://lists.apache.org/thread.html/128a2bec285ad45aa4189ebb39a15b39dcf6d91c4ab0278ff4f7cdea@%3Cdev.arrow.apache.org%3E

(you have to be subscribed to this list to post to it)

Fixing the architectural problems with wheels and the tooling around them is best taken care of by other people. I have corporate sponsors that I am accountable to and "I spent all my time fixing Python packaging instead of building new Arrow features" is not something I can say to them and expect to continue having funding.

I hope that things get better in the future. For the sake of the Python ecosystem and the people in it.

Since RAPIDs was mentioned here, I'm one of the maintainers of the RAPIDs libraries.

I would just like to echo the same things that @wesm and @pitrou expressed. Additionally, instead of a single library, RAPIDs has multiple C++ libraries that depend on one another as well as Arrow and numerous CUDA libs. This makes non-compliant wheel packaging extremely difficult and compliant wheel packaging essentially impossible, and it prompted us to just abandon supporting wheels for the foreseeable future. We aren't the only GPU project with these issues, but many of them have chosen to just produce non-compliant wheels since they don't want to static link or ship the CUDA libs.

We have had a relatively painless time supporting conda packages and would love to support the non-conda users in the Python ecosystem, but similarly to Arrow it's nearly impossible to justify the effort to return to try maintain wheel packaging.

NOTE: I am an NVIDIA employee

wesm commented

Thanks @kkraus14 -- Arrow actually has a libarrow_cuda.so optional library and pyarrow.cuda optional extension that we aren't building in the wheels on account of these issues.

I will repeat this idea again to see if anyone is willing to do this (I can't personally do this work right now --- but I can pay someone to do it via whatever contract is needed --- please let me know). I would sponsor writing and implementing a PEP that promotes something like the following.

Python packages should be able to easily specify their non-python dependencies. These are files that will not be installed by pip but which must be available for a "pip install" to proceed. These files could be specified in a way that is cross-platform or platform specific. Then, there must be a way to configure pip to call out to another command to install these dependencies.

At that point, someone just starting from python.org could have their pip configured by default to use some of the solutions being described in this thread.

If such a PEP were accepted and implemented, I believe we would have a real road forward that does not create conflicts between people who get their Python from different sources. In the current situation we do have that very real and currently unresolvable conflict.

Some Python users are very happy with pip + wheels while others (like me) who have no problem with pip are very unhappy with widely promoted pip + wheels breaking their systems and some very useful package authors struggling to actually provide wheels. I do believe this conflict is currently unresolvable without some real changes that allow for more compatibility between pip and other (system) package managers. I would be happy to understand and fund alternative proposals that could also achieve these ends.

(Full disclosure, I do own shares of Anaconda but I do not work for Anaconda. I honestly do not think this is the reason for my recommendations which actually stem from many years of experience building and distributing software --- starting with the SciPy library. In fact, I honestly think my Anaconda stock does better if things stay as they are because the current situation only drives more people to the holistic solution that Anaconda provides).

Let me preface this by saying I think it's perfectly fine for Arrow or RAPIDs or literally anything else to stop producing wheels. It's no different than these projects deciding not to support any other feature of Python or really anything.

While you're accountable to corporate sponsors, almost all of the packaging work is done entirely by volunteers or, lately, by people writing grants to get specific targeted proposals funded. This means that almost all of our resources are sporadic by nature AND we cannot direct these people what to work on, people will work on what interests them, what solves a problem they have, or what they think is important. There is not likely going to ever be a plan in place until one of the affected parties (which I might point out, a fair number of the affected projects have heavy corporate involvement or are directly sponsored and so have far more resources than the entire PyPA set of tools and standards do) steps up and puts in effort to make it happen.

In Wes's email to Arrow there is a cry of how the PyPA derived tools only work for 95% of packages, then points to Conda as if it works in every situation-- it doesn't. There is a lot of overlap between situations where once could choose conda and one could choose pip and both would work perfectly fine. Then there are areas where the design of pip and the PyPA tooling make it difficult to support some particular use case, and likewise there are areas where the design of conda and Anaconda make it difficult to support some other use case.

It sucks that Arrow, RAPIDs, etc are hitting a case where the PyPA tooling doesn't have a great story for, and if Conda solves those use cases better for you, then I fully support you recommending and even exclusively producing packages for conda.

What isn't OK is trying to allude to the fact that we're somehow negligent because we simply don't have the resources to support some use case that happens to affect you in particular. Quite frankly, posts like that are extremely demoralizing and IMO downright toxic to the OSS community as a whole and directly lead to burnout of maintainers.

With regards to @teoliphant's latest post-- I do hope somebody takes him up on it (or does something similar). I personally have more pressing issues (and I'm already salaried, so it's hard for me to convert raw $ into more time to work on Python's packaging) and I'm not willing or able to sacrifice even more of my non-programming free time for a problem that quite frankly, doesn't affect me.

I'm going disengage from commenting on this issue now, but I'm going to continue to monitor it and I'm going to be more proactive in moderating non-constructive comments from either "side".

Python packages should be able to easily specify their non-python dependencies. These are files that will not be installed by pip but which must be available for a "pip install" to proceed. These files could be specified in a way that is cross-platform or platform specific. Then, there must be a way to configure pip to call out to another command to install these dependencies.

IMO, realistically, the only platform that it makes sense to support like this is conda. The hardest target for wheels right now is Windows, and AFAIK on Windows conda is the only relevant package manager. (Yeah, there's chocolatey and stuff like that, but I don't think it's going to give you hundreds of niche scientific libraries?) On macOS I guess there's also homebrew, but I haven't heard anyone wishing they could make homebrew-only wheels. And on Linux there's too much diversity between package managers -- again, I haven't heard anyone saying "I really wish I had to build 10 different wheels to target 10 different distros". OTOH conda is available on all three platforms in a pretty uniform way, and already really popular with Python users -- and in particular, the Python users targeted by the projects that struggle the most with wheels.

I'm not sure "call out to another command" is enough to provide useful integration -- the whole reason conda and pip don't get along is that doing dependency resolution correctly requires a global view of everything that's installed, and if you have two different package managers that each only see half the packages, there's just no way to make that work.

So if someone wants to go this route, I think the ideal steps would be:

  • Define a wheel platform tag that means "this wheel targets conda environments specifically"
  • Define some extra metadata to say "this wheel depends on these conda packages", that's only legal inside wheels that have the conda platform tag
  • Teach conda how to install these wheels and feed that metadata into its dependency resolver

Let me preface this by saying I think it's perfectly fine for Arrow or RAPIDs or literally anything else to stop producing wheels.

But then you should acknowledge this publicly and document that conda is a more general alternative to pip for complex use cases. Not necessarily say "pip is obsolete, use conda instead", but at least stop promoting pip as the one official Python packaging tool.

What isn't OK is trying to allude to the fact that we're somehow negligent because we simply don't have the resources to support some use case that happens to affect you in particular.

What? We are not asking you to solve those problems inside the pip and wheel paradigm. For all concerns, you (the PyPA) are the ones insisting that pip is the one official Python package manager, and therefore implying it should handle those use cases as well.

What we are asking you is for the PyPA's official stance to start documenting and promoting conda as a better alternative for complex situations involving native libraries etc.

Can we have a yes or no answer to that question?

but at least stop promoting pip as the one official Python packaging tool.

I think this would require a new PEP after PEP 453.

For all concerns, you (the PyPA) are the ones insisting that pip is the one official Python package manager, and therefore implying it should handle those use cases as well.

pip handles sdist fine, wheels are certainly not mandatory.

What we are asking you is for the PyPA's official stance to start documenting and promoting conda as a better alternative for complex situations involving native libraries etc.

PR to https://github.com/pypa/packaging.python.org are welcome if you think that https://packaging.python.org/guides/installing-scientific-packages/#the-conda-cross-platform-package-manager should be improved.

a bit more complicated: I'd like to test the (somehow) produced wheels in a mostly standardized environment on windows, multiple linux distros and osx

We already do that in the Multibuild framework, at least for Manylinux and macOS. The framework builds the wheel, then installs it, and tests it, where the Manylinux wheel gets tested in a different container / distribution from the Manylinux build container.

Let me preface this by saying I think it's perfectly fine for Arrow or RAPIDs or literally anything else to stop producing wheels. It's no different than these projects deciding not to support any other feature of Python or really anything.

Here I'm claiming to be one of the people trying to find a constructive path forward. With those claimed credentials, I point to this phrase from Wes' email (he links to it further up):

It seems clear to me that the self-appointed Python Package "Authority" is not acting in our best interests, and seems to have adopted the position that it's acceptable to have a language-specific binary packaging system that works well for 95% of use cases but causes unbounded punishment for a small percentage of packages.

That may be too harsh a statement of the problem, but it's not too much of a stretch from comments here and elsewhere.

I think we have to accept that those of us who care about the viability of Python.org / Homebrew Python, and the commercial / not-commercial balance in Python, have to care about the building of difficult wheels. I don't mean that we all have to do something, but I think we do have to be very careful to avoid statements that imply that we don't care about the problem.

@njsmith - yes, the auditwheel / delocate situation isn't very clean. You probably remember, but auditwheel started life as a partial fork of delocate - see git log --stat 1feb5f.

@matthew-brett, @mikofski, thanks for bringing mingwpy into attention again #25 (comment).

The status is, I'm still working on mingwpy but without too much priority due to stiff time constraints. However, I'm not sure there's so much need for mingwpy anymore. mingwpy's Fortran support could also be replaced by flang sooner or later, see isuruf/flang#129. Of course it would be nice to have a pure OSS compiler toolchain as an alternative to VS2017.

Reading this discussion another aspect comes into my mind: The original idea was to make the installation and usage of the toolchain as simple as possible. Therefore the complete compiler toolchain was packed into into a wheel: https://anaconda.org/carlkl/mingwpy. But is this really the preferred solution?

pip will install complex packages from sdists just fine, as long as the target environment is appropriately set up for it.

With proper dynamic build dependencies now in place, this means that instead of shipping complex wheels, projects can instead ship sdists, with a build dependency that knows how to interrogate the target environment for compatibility, and error out saying "Use conda, or another supported platform".

There are plenty of sdist only packages on PyPI (especially for Linux).

Now, if folks react to that by saying "but we have to ship wheels, too many of our users don't have a compiler, and don't use conda", then it isn't clear how "conda only" could be viable when conda+sdist isn't.

wesm commented

Nick -- in our case, very few target environments will be able to build the pyarrow package from an sdist. There is almost no point in even publishing the tarball. Even bleeding edge Linux distributions don't have new enough releases of certain packages in some cases (like gRPC). The most viable platform for sdist is macOS because of Homebrew, but Linux and Windows are DOA in practice

Folks that haven't seen it before may also find https://www.curiousefficiency.org/posts/2016/09/python-packaging-ecosystem.html#platform-management-or-plugin-management of interest.

Playing nice with non-conda platforms as a conda-centric project involves publishing a self-building sdist moreso than it involves publishing pre-built wheels - publishing a wheel is certainly nicer (since it makes the project easier to use in more "bring your own Python" contexts), but full-fledged system integrators will be building from the sdist independently of whether a project publishes a wheel archive or not.

That "pre-built only for conda, source builds for everyone else" is an entirely reasonable position for publishers to take.

We can then work with the conda-forge folks to see what can be done to produce wheels as a side effect of conda builds, rather than something publishers need to invest significant extra time into.

@wesm The folks building the Homebrew/Nix/Arch/etc packages still need to get the source from somewhere, and sdist is a better format for that than a conda recipe.

There's still an ecosystem level need for static declarations of external dependencies, but build toolchains are better able to provide interim workarounds for that lack than pre-built binary packages are.

That "pre-built only for conda, source builds for everyone else" is an entirely reasonable position for publishers to take.

I see the attraction of this position - it allows complete disengagement from the packaging problem. But if most developers took that seriously, it would make pip irrelevant for the exact set of users that is causing such big growth in Python at the moment.

I guess you'd say 'Let them eat Conda', but the consequence is that pip becomes an afterthought in the scientific Python / data science world, and Python changes ownership from Python.org, Homebrew, and Linux distribution installs, to an ecosystem centred on Anaconda, the company. I think that's not what we want.

@msarahan - I had skip read Wes' post, but now I've read it more slowly, thanks for the reminder.

I don't know the Conda world very well, but of course I do know about Conda Forge.

In practice, I think a very large number of people use Conda because of the Anaconda distribution. As far as I was aware, please correct me if I'm wrong, the Anaconda channel is still the default in Conda. This gives Anaconda, the company, a big influence on packaging.

Just for example, imagine that one of Anaconda's big customers wanted a feature in Pandas that the community would not accept. As things stand, I believe it would be technically easy for Anaconda to patch Pandas and make that the default install, thus overriding the developers. Is that not so?

How well do you think the Conda ecosystem would do, if Anaconda (the company) disappeared tomorrow? Are you saying the a full install of Python data packages on Conda Forge does not substantially depend on work provided by developers paid by Anaconda?

If you were to tell me that now, or soon, most people installing via Conda do not depend on any decisions made by Anaconda, the company, then that would certainly help. As you say, governance is crucial - we need to be sure that the community can guide packaging in a transparent way that balances the needs of players with large amounts of money, and those without.

And we need to think about the role of Conda in relation to the various other Pythons, such as Python.org Python, Homebrew, apt-get / yum etc installed Python.

I don't have anything against depending on Conda Forge for Wheel packaging work - I am sure that would be very useful.

I don't know what "the default install" means here. But let's assume that many PostgreSQL installs use the official RedHat packages. Does that mean PostgreSQL is owned by RedHat? So, if many people install Pandas using Anaconda, does that mean Pandas is owned by Anaconda?