MeteoSwiss-APN/mch-python-blueprint

Requirements, environments and conda

Closed this issue · 6 comments

Last week, @cosunae , @regDaniel and I has a brief discussion on how to organize unpinned requirements and pinned environment specifications to produce reproducible conda environments, with the following results:

  • Unpinned requirements are currently specified in the files requirements.in and dev-requirements.in, which are plain text files that can be passed to conda create -n <name> -f <file>.
  • Conda environments are subsequently exported to the files environment.yml and dev-environment.yml, which contain additional information such as channels and can be passed to conda env create -f <file>.
  • In principle, environment.yml could be dropped for simplicity, provided installing the dev dependencies during deployment causes negligible overhead. While this would mean one fewer file to maintain and test, we decided to keep both *environment.yml files for the time being.

A question that came up was whether to also use YAML instead of plain TXT files to specify the unpinned requirements. Given our understanding at the time was that YAML is only used for pinned environments, we decided to stick with TXT files.

Since the meeting, I've conducted some more research on these topics, which was very illuminating and cleared up a lot of points that were not totally clear until now.

My main conclusions:

  • YAML is, after all, the better choice also for unpinned requirements files than plain TXT (*.in) files (channel specification, pip support).
  • Channel specification is a mess, and using the right channels in the right order for a given project requires some care. (If everyone primarily uses conda-forge, though, this should not be a big issue in practice.)
  • Conda is not totally unaware of pip-installed packages, after all, and with some caution, working with a mix of conda and pip packages should be possible.
  • There is no silver bullet when it comes to truly reproducible conda environments. (Whether this actually matters in practice, given the small number of systems that need to be supported for deployment, is another question.)

Following is a more detailed write-up on various topics. (It's become way longer than anticipated because I used it to properly sort this all out in my head)

conda vs. conda-env

Conda environments can be created with either conda create or conda env create. The main difference between the two is that conda create works with TXT files, while conda env create works with YAML files. From what I understand, conda-env started out separate from conda but was integrated as conda env at some point, which is why they are similar yet subtly different (which is a frequent cause of confusion).

conda can export environments as TXT with conda list --export, which outputs all installed packages with pinned version number and build identifier. Furthermore, conda list --explicit outputs the links to all packages, which is the best way to recreate a truly identical environment, albeit only for a given platform (link) and without pip support. To recreate an environment from a TXT file, use conda create -n <name> -f <file>.

conda env can export environments as YAML with conda env export, which outputs the same list of packages as conda list --export, but the YAML format allows for the inclusion of additional information like the environment name and the channels (albeit not necessarily those used to create the environment from a YAML file, as described in the channels section). To recreate an environment from a YAML file, use conda env create -n <name> -f <file>.

A major difference between the two commands/formats is that the YAML files also contain pip-installed packages (and improved interoperability with pip is an area of active development), whereas those are missing from the TXT files (click).

Note that there is an ongoing effort to unify environment specifications between the conda and conda env (as well as anaconda-project) commands.

A thorough approach combining both commands and pip freeze is described here:

conda env export > environment.yml
conda list --explicit > spec-file.txt
pip freeze > requirements.txt
conda create --name NEWENV --file spec-file.txt
conda env update --name NEWENV --file environment.yml
pip install -r requirements.txt

Whether this is actually necessary in an environment with few architectures/machines is another question, but it is at least worth noting here. (Also, it is not entirely clear to me why the pip commands are necessary, given environment.yml should already contain the pip-installed packages.)

Channels

Conda packages are organized in channels. There is a default channel, but in practice, the de-facto standard channel is conda-forge (essentially the conda equivalent of pypi containing community-provided packages). Channels shouldn't be mixed, so conda-forge should take precedence over default (or the latter deactivated altogether with nodefaults).

By default, channels are managed globally, usually specified in a user's ~/.condarc. The channels to be used during environment creation with conda env create -f <file> (provided it works) can also be specified in *environment.yml files. However, this channel specification is not transferred into the environment, so subsequent operations like conda install or (notably) conda env export fall back the user's global channel specification. Therefore, if the channels in a *.yml file differ from a user's global config, care must be taken when creating an updated *.yml file in order to maintain the correct channels.

It is possible, though, to change the conda config for a specific environment, e.g., set the channels with with conda config --env --add channels nodefaults --add channels conda-forge (click, click, click). This writes the config to the environment-specific file ${CONDA_PREFIX}/.condarc, which will takes precedence over the user's ~/.condarc when the environment is active. (If ~/.condarc contains channels not in ${CONDA_PREFIX}/.condarc, however, those will still appear after the onces specified in the latter file; see also here.)

If everyone primarily uses conda-forge, this shouldn't matter much in practice. However, if someone uses additional channels (like our group channel), or different projects use different channels (or the same channels in different order), care must be taken during conda env export that the channels used for a project are not accidentally changed.

It is also possible to specify a channel for a specific package with <channel>::<package> (click, click), which is useful in YAML requirements files. However, to my knowledge, conda env export does not provide a way to pin the channel for each package (which, again, probably doesn't matter much in practice).

Environment-specific config file

With version 4.2, conda has gained support for environment-specific config files (${CONDA_PREFIX}/.condarc). A project could thus ship with a conda config file that contains, e.g., the channels in the correct order (see respective section), or any other configurations that should be the same for all developers (but not necessarily for all projects). As far as I can tell, though, there is no automatic way to install such a .condarc file during environment creation, so the file needs to be copied manually after environment creation:

conda env create -n my-project -f dev-environment.yml
conda activate my-project
cp .condarc ${CONDA_PREFIX}/.condarc

While not that big a deal, this introduces one more step that is easily forgotten. Alternatively, environment creation could be wrapped in a simple script, which would ensure that the .condarc file is always copied -- but this of course would require one more script to be tested and maintained (unless such a script already exist, anyway).

Given these complications, I'd conclude that project-specific conda config files are probably not necessary at this point. However, it is worth keeping in mind (i.e., documenting) that project-specific conda configs are possible with little effort, should their benefit ever outweigh the cost of copying the .condarc file after each environment creation.

Unpinned requirements file

Until now, I was under the impression that unpinned environments are defined with TXT files and created with conda create, whereas pinned environments are exported to YAML files and recreated with conda env create. Turns out, though, that both conda and conda env can create environments from unpinned requirements and export pinned environments that can be recreated; they just differ in their file format (TXT and YAML, respectively). So whether to use YAML also for unpinned requirements (click) is a question worth considering, after all.

YAML files provide two big advantages over TXT files:

  • Channels used during installation can be specified (though the channel settings are not actually transferred to the environment, as described in the channels section).
  • Pip packages can be specified in addition to conda packages.

For unpinned requirements, TXT files really only minor advantages:

  • Their syntax is a little bit simpler.
  • They are also understood by pip and other native Python tools.

The latter is leveraged in the blueprint to read the requirements into setup.py with pkg_utils.parse_requirements, so pip can automatically install the requirements when installing the package (a compromise between specifying the unpinned runtime dependencies in setup.py, as is best practice, and having a requirements.in file to avoid duplication). However, with conda, this is actually pointless, as the requirements are installed with conda before installing the package itself with, e.g., pip install -e . --no-deps (whereby --no-deps has pip ignore the requirements to prevent pip from, e.g., accidentally installing a newer version than available on conda). Instead of passing --no-deps, reading requirements.in into setup.py could as well just be omitted in the first place. This would free us to choose a different file format to specify the requirements than the pip-compatible TXT format.

My conclusions from this:

  • YAML is the better choice for unpinned requirements files, primarily because it allows for the inclusion of pip packages. I'd therefore suggest to replace the {,dev-}requirements.in files by {,dev-}requirements.yml files.
  • Reading requirements into setup.py is unnecessary if the package is always installed in a conda environment. I'd opt for commenting the code in setup.py that reads requirements.in rather than remove it (alongside a comment explaining the situation), such that a project could be made compatible with pip by just uncommenting those lines.

PS: I stumbled over pipreqs, which is a tool that scans the source code of a project and creates a requirements.txt file with the direct dependencies based on pypi. I haven't tried it out, but it sounds like a useful tool to create requirements files for legacy projects that don't yet have one, or to periodically check existing requirements files against the evolving source code. If it works as advertised, it could be included in the default dev requirements and its usage briefly be documented.

Pinned environment files

Reproducible environments can be exported with conda env export to YAML files (click). These contain the whole conda package tree with fixed version numbers, along with any pip-installed packages in a separate section, as well as the conda channels (albeit not necessarily those used to create the environment from a YAML file, see section on channels) and even environment variables set through conda (click). There are some caveats:

  • While the resulting environments file doesn't contain any platform-specific information, in practice it is generally not cross-platform compatible because some low-level dependencies may different on different platforms, so an environment exported under linux likely won't be recreatable on Mac OS or Windows (e.g., click, click).

  • Because channels are not specified per package (which conda env create actually supports with - channel::package==1.2.3, but conda env export doesn't), a package could in principle be installed from a different channel, e.g., if it were added to a channel with higher precedence between exporting and recreating an environment.

Alternatively, conda list --explicit creates a list of the full dependency tree in the form of direct download links. The resulting plain TXT files, called lock files, are explicitly bound to a specific platform (as the YAML environment files implicitly are, too), and additionally pin the channel as part of the URLs, addressing both caveats listed above. However, the big drawback is that conda list cannot deal with pip-installed packages, which would need to be kept track of separately (e.g., with pip and *requirements.in/*requirements.txt files). (Note that conda list --export omitted here because the resulting package list offers no advantages over YAML files.)

Personally, I see the advantage of pip support of YAML environment files as trumping the more explicit and specific pinning of the TXT lock files, and would therefore recommend to stick with conda env export to pin environments. That these are implicitly bound to a specific platform should not be a major issue in practice, as most development and all operations are done on linux, anyway (and *environment.osx.yml files could always be added to a specific project if a developer often works on a different system like a Mac). But the creation of lock files with conda list --explicit should be documented as a fall-back option in case any issues with conda env export should ever arise in a project.

Of note regarding lock files is the tool conda-lock (click), which allows one to create and update lock files for multiple platforms. It even offers pip support, and thus appears to combine the advantages of YAML environment and TXT lock files. I have not tried it out, but it could be a convenient way of exporting truly reproducible environments incl. pip support.

Thanks @ruestefa! So I would suggest to already implement the following in #32

  • Move to .yml for unpinned envs.
  • use spec files for the creation of pinned environments.
  • make the install script search for pip dependecies listed in requirements.txt

What do you think?

What do you think?

  • YAML for requirements: Agreed.
  • Spec file for pinned: I would stick with YAML because of the pip support (assuming with "spec files" you mean the TXT lock files). While lock files might provide a bit better reproducibility, those concerns are much more hypothetical than the occasional need to install a package with pip b/c it's missing or outdated in conda.
  • There's no requirements.txt in the blueprint anymore... With "install script" do you mean setup.py?
  • Agreed on YAML
  • Agreed on Spec files
  • No, I mean some of our packages (i.p. kenda_python) do have pip dependencies (in particular first-party packages). So far we captured those in requirements/pip-requirements.in and installed them at the install targets in the Makefile. So my question would be where we capture those and as far as I understood, we could do this in a file called requirements.txt or maybe better pip-requirements.txt but maybe I'm missing an important point here.

Ah, I see, no, there shouldn't be any need for any *requirements.txt files once the requirements are specified in YAML, then it should be possible like this (copied from the docs):

name: stats2
channels:
  - javascript
dependencies:
  - python=3.9
  - bokeh=2.4.2
  - numpy=1.21.*
  - nodejs=16.13.*
  - flask
  - pip
  - pip:
    - Flask-Testing

At least that's the case for pypi packages, unfortunately conda cannot deal with packages installed directly from github yet (the links to github get lost during conda env export); those still have to be managed separately -- or they could be uploaded to pipy, conda-forge or some custom/internal conda channel (I have some experience with the latter, i.e., releasing conda packages in a custom channel).

Yes, the last case you mention is exactly the one that is probably the most common for us (APND). So yep we have to think about/ discuss how to handle this. Then my suggestion would be to do the remaining work in this PR, open an issue and then have another discussion in the group.

Status:

  • Requirements files have been converted to YAML in #32.
  • A simple bash script to create conda envs has been added in #32.
    • The script could be used to install an environment-specific .condarc (new issue: #45).
  • Opened new issues about pipreqs (#43) and conda-lock (#45) so these tools are not forgotten.

With that, this issue can be closed.