maize-genetics/phg_v2

[REQUEST]: allow non-conda installation of dependencies

Closed this issue · 10 comments

Description

I have been trying to get phg to work on my system as a singularity image, but I have been facing an issue with that: the code assumes that the user always has a conda environment which contains the dependencies. Since a singularity image (or docker container for that matter) can have all dependencies inside without the need for a conda environment that is needed, is it possible to have a --no-conda option or something similar which assumes all dependencies are available on the ${PATH}?

For testing, I have written this singularity .def file (for version 2.3):

Bootstrap: docker
From: mambaorg/micromamba:1.5.8

%post
   apt-get update && apt-get install -y wget

%post
   micromamba install -y -n base -c conda-forge -c bioconda -c tiledb python=3.8.15 tiledb-py=0.22.3 tiledbvcf-py=0.25.3 anchorwave=1.2.2 bcftools=1.16 samtools=1.16.1 agc=3.0 openjdk=17.0.10
   micromamba clean --all --yes

%post
   mkdir -p /opt
   cd /opt
   wget https://github.com/maize-genetics/phg_v2/releases/download/2.3.16.153/PHGv2-v2.3.tar
   tar xvf PHGv2-v2.3.tar
   rm PHGv2-v2.3.tar

%environment
   export PATH=/opt/phg/bin:$PATH
   export JAVA_OPTS="-Xmx50g"

Alternatives

No response

Additional Context

No response

This is something we can consider. Is there a reason you prefer docker vs conda?

I prefer having all in one single (singularity) container so it's easy to incorporate in my pipelines. I prefer to run most tools in a Snakemake pipeline myself so it's reproducible later on. I think removing the need for having a specific conda environment name could solve this :)

I tried to implement it myself in a PR but I cannot get the tests to run successfully (also not without any changes, there seems to be the assumption that I have a full TileDB available at $HOME/temp/phgv2Tests/tempDir/testTileDBURI/, which I don't.

If you prefer to implement it yourself, no worries! Just thought I could give it a go!

We have created a card to consider this request. If implemented, it may not be via a parameter, but based on other internal changes to the code. One of our goals is to keep parameters to a minimum. We find an abundance of parameters results in an interface that is confusing to users. At the moment we have higher priorities so I cannot predict when we will address it.

In the meantime, one option for you is to take our phg_environment.yml file and create a "phgv2-conda" conda environment inside your docker. You would not need to run the environment, just create it. If you decide to try this, please let us know how it works. We appreciate your feedback!

The command to run inside your docker would be:
conda env create --solver=libmamba --file src/main/resources/phg_environment.yml

(replace "src/main/resources/phg_environment.yml" with the path to your copy of the phg_environment.yml file)

After playing around with your suggestion and some other ideas, I have created a working version. It basically creates a script called conda which checks if phg is wanting to run something in your default environment name and if so, it removes the conda run -n phgv2-conda from the command.

The singularity .def file (works with singularity v3.9; haven't tested other versions):

Bootstrap: docker
From: mambaorg/micromamba:1.5.8

%post
   apt-get update && apt-get install -y wget

%post
   mkdir -p /opt
   cd /opt
   wget https://github.com/maize-genetics/phg_v2/releases/download/2.3.16.153/PHGv2-v2.3.tar
   tar xvf PHGv2-v2.3.tar
   rm PHGv2-v2.3.tar

%post
   micromamba install -y -n base -c conda-forge -c bioconda -c tiledb python=3.8.15 tiledb-py=0.22.3 tiledbvcf-py=0.25.3 anchorwave=1.2.2 bcftools=1.16 samtools=1.16.1 agc=3.0 openjdk=17.0.10
   micromamba clean --all --yes

%post
    cat << 'EOF' > /usr/local/bin/conda
#!/bin/bash
if [[ "$1" == "run" && ("$2" == "-n" || $2 == "--name") && "$3" == "phgv2-conda" ]]; then
    shift 3
    exec micromamba run --name base "$@"
else
    echo "conda is not installed; use micromamba instead"
    exit 1
fi
EOF
    chmod +x /usr/local/bin/conda

%environment
    export PATH=/usr/local/bin:/opt/phg/bin:/opt/conda/bin:$PATH
    export JAVA_OPTS="-Xmx50g"

%runscript
    echo "Running: $*"
    exec "$@"

You may close this issue as this solves it for me and you indicated such a workaround is preferred for now.

I'm glad you found a solution that works for you. Keep in mind you need to be sure the phgv2 required programs you load from conda must have tags that match the release of the phgv2 version you are pulling. Otherwise there will be errors in execution.

Yes I will! That's also why I have the version of phgv2 hardcoded. But as far as I'm aware the YAML file is not part of the github release files? If it is, that would make it easier to write it for another version but for now I'll check the dependencies per version :)

Correct, the yml file is not part of the release as an individual file. To access the contents of it you would need to do this programmatically with a getResource("phg_environment.yml") command against the java class. If you think this would be useful, we could consider putting phg_environment.yml in the phg/resources/main folder with the application.conf file.

For me it is not needed since I know where to look, but should you decide to add a docker and/or singularity definition file to your repo it would definitely make it more future- and fool-proof I think.

Closing this issue as user has workaround in place.