Requirements, environments and conda
Closed this issue · 6 comments
Last week, @cosunae , @regDaniel and I has a brief discussion on how to organize unpinned requirements and pinned environment specifications to produce reproducible conda environments, with the following results:
- Unpinned requirements are currently specified in the files
requirements.in
anddev-requirements.in
, which are plain text files that can be passed toconda create -n <name> -f <file>
. - Conda environments are subsequently exported to the files
environment.yml
anddev-environment.yml
, which contain additional information such as channels and can be passed toconda env create -f <file>
. - In principle,
environment.yml
could be dropped for simplicity, provided installing the dev dependencies during deployment causes negligible overhead. While this would mean one fewer file to maintain and test, we decided to keep both*environment.yml
files for the time being.
A question that came up was whether to also use YAML instead of plain TXT files to specify the unpinned requirements. Given our understanding at the time was that YAML is only used for pinned environments, we decided to stick with TXT files.
Since the meeting, I've conducted some more research on these topics, which was very illuminating and cleared up a lot of points that were not totally clear until now.
My main conclusions:
- YAML is, after all, the better choice also for unpinned requirements files than plain TXT (
*.in
) files (channel specification, pip support). - Channel specification is a mess, and using the right channels in the right order for a given project requires some care. (If everyone primarily uses
conda-forge
, though, this should not be a big issue in practice.) - Conda is not totally unaware of pip-installed packages, after all, and with some caution, working with a mix of conda and pip packages should be possible.
- There is no silver bullet when it comes to truly reproducible conda environments. (Whether this actually matters in practice, given the small number of systems that need to be supported for deployment, is another question.)
Following is a more detailed write-up on various topics. (It's become way longer than anticipated because I used it to properly sort this all out in my head)
conda
vs. conda-env
Conda environments can be created with either conda create
or conda env create
. The main difference between the two is that conda create
works with TXT files, while conda env create
works with YAML files. From what I understand, conda-env
started out separate from conda but was integrated as conda env
at some point, which is why they are similar yet subtly different (which is a frequent cause of confusion).
conda
can export environments as TXT with conda list --export
, which outputs all installed packages with pinned version number and build identifier. Furthermore, conda list --explicit
outputs the links to all packages, which is the best way to recreate a truly identical environment, albeit only for a given platform (link) and without pip support. To recreate an environment from a TXT file, use conda create -n <name> -f <file>
.
conda env
can export environments as YAML with conda env export
, which outputs the same list of packages as conda list --export
, but the YAML format allows for the inclusion of additional information like the environment name and the channels (albeit not necessarily those used to create the environment from a YAML file, as described in the channels section). To recreate an environment from a YAML file, use conda env create -n <name> -f <file>
.
A major difference between the two commands/formats is that the YAML files also contain pip-installed packages (and improved interoperability with pip is an area of active development), whereas those are missing from the TXT files (click).
Note that there is an ongoing effort to unify environment specifications between the conda
and conda env
(as well as anaconda-project
) commands.
A thorough approach combining both commands and pip freeze
is described here:
conda env export > environment.yml
conda list --explicit > spec-file.txt
pip freeze > requirements.txt
conda create --name NEWENV --file spec-file.txt
conda env update --name NEWENV --file environment.yml
pip install -r requirements.txt
Whether this is actually necessary in an environment with few architectures/machines is another question, but it is at least worth noting here. (Also, it is not entirely clear to me why the pip commands are necessary, given environment.yml
should already contain the pip-installed packages.)
Channels
Conda packages are organized in channels. There is a default
channel, but in practice, the de-facto standard channel is conda-forge
(essentially the conda equivalent of pypi containing community-provided packages). Channels shouldn't be mixed, so conda-forge
should take precedence over default
(or the latter deactivated altogether with nodefaults
).
By default, channels are managed globally, usually specified in a user's ~/.condarc
. The channels to be used during environment creation with conda env create -f <file>
(provided it works) can also be specified in *environment.yml
files. However, this channel specification is not transferred into the environment, so subsequent operations like conda install
or (notably) conda env export
fall back the user's global channel specification. Therefore, if the channels in a *.yml
file differ from a user's global config, care must be taken when creating an updated *.yml
file in order to maintain the correct channels.
It is possible, though, to change the conda config for a specific environment, e.g., set the channels with with conda config --env --add channels nodefaults --add channels conda-forge
(click, click, click). This writes the config to the environment-specific file ${CONDA_PREFIX}/.condarc
, which will takes precedence over the user's ~/.condarc
when the environment is active. (If ~/.condarc
contains channels not in ${CONDA_PREFIX}/.condarc
, however, those will still appear after the onces specified in the latter file; see also here.)
If everyone primarily uses conda-forge
, this shouldn't matter much in practice. However, if someone uses additional channels (like our group channel), or different projects use different channels (or the same channels in different order), care must be taken during conda env export
that the channels used for a project are not accidentally changed.
It is also possible to specify a channel for a specific package with <channel>::<package>
(click, click), which is useful in YAML requirements files. However, to my knowledge, conda env export
does not provide a way to pin the channel for each package (which, again, probably doesn't matter much in practice).
Environment-specific config file
With version 4.2, conda has gained support for environment-specific config files (${CONDA_PREFIX}/.condarc
). A project could thus ship with a conda config file that contains, e.g., the channels in the correct order (see respective section), or any other configurations that should be the same for all developers (but not necessarily for all projects). As far as I can tell, though, there is no automatic way to install such a .condarc
file during environment creation, so the file needs to be copied manually after environment creation:
conda env create -n my-project -f dev-environment.yml
conda activate my-project
cp .condarc ${CONDA_PREFIX}/.condarc
While not that big a deal, this introduces one more step that is easily forgotten. Alternatively, environment creation could be wrapped in a simple script, which would ensure that the .condarc
file is always copied -- but this of course would require one more script to be tested and maintained (unless such a script already exist, anyway).
Given these complications, I'd conclude that project-specific conda config files are probably not necessary at this point. However, it is worth keeping in mind (i.e., documenting) that project-specific conda configs are possible with little effort, should their benefit ever outweigh the cost of copying the .condarc
file after each environment creation.
Unpinned requirements file
Until now, I was under the impression that unpinned environments are defined with TXT files and created with conda create
, whereas pinned environments are exported to YAML files and recreated with conda env create
. Turns out, though, that both conda
and conda env
can create environments from unpinned requirements and export pinned environments that can be recreated; they just differ in their file format (TXT and YAML, respectively). So whether to use YAML also for unpinned requirements (click) is a question worth considering, after all.
YAML files provide two big advantages over TXT files:
- Channels used during installation can be specified (though the channel settings are not actually transferred to the environment, as described in the channels section).
- Pip packages can be specified in addition to conda packages.
For unpinned requirements, TXT files really only minor advantages:
- Their syntax is a little bit simpler.
- They are also understood by pip and other native Python tools.
The latter is leveraged in the blueprint to read the requirements into setup.py
with pkg_utils.parse_requirements
, so pip can automatically install the requirements when installing the package (a compromise between specifying the unpinned runtime dependencies in setup.py
, as is best practice, and having a requirements.in
file to avoid duplication). However, with conda, this is actually pointless, as the requirements are installed with conda before installing the package itself with, e.g., pip install -e . --no-deps
(whereby --no-deps
has pip ignore the requirements to prevent pip from, e.g., accidentally installing a newer version than available on conda). Instead of passing --no-deps
, reading requirements.in
into setup.py
could as well just be omitted in the first place. This would free us to choose a different file format to specify the requirements than the pip-compatible TXT format.
My conclusions from this:
- YAML is the better choice for unpinned requirements files, primarily because it allows for the inclusion of pip packages. I'd therefore suggest to replace the
{,dev-}requirements.in
files by{,dev-}requirements.yml
files. - Reading requirements into
setup.py
is unnecessary if the package is always installed in a conda environment. I'd opt for commenting the code insetup.py
that readsrequirements.in
rather than remove it (alongside a comment explaining the situation), such that a project could be made compatible with pip by just uncommenting those lines.
PS: I stumbled over pipreqs, which is a tool that scans the source code of a project and creates a requirements.txt
file with the direct dependencies based on pypi. I haven't tried it out, but it sounds like a useful tool to create requirements files for legacy projects that don't yet have one, or to periodically check existing requirements files against the evolving source code. If it works as advertised, it could be included in the default dev requirements and its usage briefly be documented.
Pinned environment files
Reproducible environments can be exported with conda env export
to YAML files (click). These contain the whole conda package tree with fixed version numbers, along with any pip-installed packages in a separate section, as well as the conda channels (albeit not necessarily those used to create the environment from a YAML file, see section on channels) and even environment variables set through conda (click). There are some caveats:
-
While the resulting environments file doesn't contain any platform-specific information, in practice it is generally not cross-platform compatible because some low-level dependencies may different on different platforms, so an environment exported under linux likely won't be recreatable on Mac OS or Windows (e.g., click, click).
-
Because channels are not specified per package (which
conda env create
actually supports with- channel::package==1.2.3
, butconda env export
doesn't), a package could in principle be installed from a different channel, e.g., if it were added to a channel with higher precedence between exporting and recreating an environment.
Alternatively, conda list --explicit
creates a list of the full dependency tree in the form of direct download links. The resulting plain TXT files, called lock files, are explicitly bound to a specific platform (as the YAML environment files implicitly are, too), and additionally pin the channel as part of the URLs, addressing both caveats listed above. However, the big drawback is that conda list
cannot deal with pip-installed packages, which would need to be kept track of separately (e.g., with pip
and *requirements.in
/*requirements.txt
files). (Note that conda list --export
omitted here because the resulting package list offers no advantages over YAML files.)
Personally, I see the advantage of pip support of YAML environment files as trumping the more explicit and specific pinning of the TXT lock files, and would therefore recommend to stick with conda env export
to pin environments. That these are implicitly bound to a specific platform should not be a major issue in practice, as most development and all operations are done on linux, anyway (and *environment.osx.yml
files could always be added to a specific project if a developer often works on a different system like a Mac). But the creation of lock files with conda list --explicit
should be documented as a fall-back option in case any issues with conda env export
should ever arise in a project.
Of note regarding lock files is the tool conda-lock (click), which allows one to create and update lock files for multiple platforms. It even offers pip support, and thus appears to combine the advantages of YAML environment and TXT lock files. I have not tried it out, but it could be a convenient way of exporting truly reproducible environments incl. pip support.
What do you think?
- YAML for requirements: Agreed.
- Spec file for pinned: I would stick with YAML because of the pip support (assuming with "spec files" you mean the TXT lock files). While lock files might provide a bit better reproducibility, those concerns are much more hypothetical than the occasional need to install a package with pip b/c it's missing or outdated in conda.
- There's no
requirements.txt
in the blueprint anymore... With "install script" do you meansetup.py
?
- Agreed on YAML
- Agreed on Spec files
- No, I mean some of our packages (i.p.
kenda_python
) do have pip dependencies (in particular first-party packages). So far we captured those inrequirements/pip-requirements.in
and installed them at theinstall
targets in theMakefile
. So my question would be where we capture those and as far as I understood, we could do this in a file calledrequirements.txt
or maybe betterpip-requirements.txt
but maybe I'm missing an important point here.
Ah, I see, no, there shouldn't be any need for any *requirements.txt
files once the requirements are specified in YAML, then it should be possible like this (copied from the docs):
name: stats2
channels:
- javascript
dependencies:
- python=3.9
- bokeh=2.4.2
- numpy=1.21.*
- nodejs=16.13.*
- flask
- pip
- pip:
- Flask-Testing
At least that's the case for pypi packages, unfortunately conda cannot deal with packages installed directly from github yet (the links to github get lost during conda env export
); those still have to be managed separately -- or they could be uploaded to pipy, conda-forge or some custom/internal conda channel (I have some experience with the latter, i.e., releasing conda packages in a custom channel).
Yes, the last case you mention is exactly the one that is probably the most common for us (APND). So yep we have to think about/ discuss how to handle this. Then my suggestion would be to do the remaining work in this PR, open an issue and then have another discussion in the group.
Status:
- Requirements files have been converted to YAML in #32.
- A simple bash script to create conda envs has been added in #32.
- The script could be used to install an environment-specific
.condarc
(new issue: #45).
- The script could be used to install an environment-specific
- Opened new issues about
pipreqs
(#43) andconda-lock
(#45) so these tools are not forgotten.
With that, this issue can be closed.