Add information on recommended method of installing Bioconductor packages
pmoris opened this issue · 18 comments
I'm having some difficulties modifying the rocker tidyverse base image with Bioconductor packages. I've written up my goal, approach and problems more extensively in an issue on littler's github page (eddelbuettel/littler#93), because I thought there was something strange going on with the --repository
flag of the install2.r
script, although that turned out to be a more low-level issue that can happen when mixing repos and thus has nothing to do with littler
itself.
I'm posting this issue here however because I hope that the rocker community can provide some guidance on how to tackle the things I'd like to do, i.e. install bioconductor in (rocker) docker and have the build fail when something goes wrong. I believe that this information could be useful for other users and could be included on the Rocker Project's guide on extending the images.
Very briefly, here are my findings and struggles:
- Neither of the two standard approaches for installing bioconductor packages that I've found (
R -e 'BiocManager::install("package")
and/usr/local/lib/R/site-library/littler/examples/installBioc.r
) raise a non-zero exit code and thus do not cause Docker builds to fail when something goes wrong (e.g. unavailable in the repo, mis-spelled name, or missing dependencies). - You can pass specific repos to the
install2.r
script (which does raise this error), but you have to pass all the specificBioCsoft
,BioCann
,BioCexp
URLs explicitly, as well as the default CRAN repo (because when any -r flag is added,install2.r
seemingly forgets about the standard CRAN repo defined byoptions("repos")
). However, as I've shown in the littler issue I linked to above, the order in which these repos are given seems to affect the outcome. - Bioconductor's instructions on modifying their own Dockerfiles (which in turn are based on Rocker) do not offer any advice on this aspect either (https://bioconductor.org/help/docker/#modifying-the-images)
- The tidyverse rocker images include Bioconductor, so I assume that it is intended to be used:
- Rocker images set
options("repos")
to frozen RStudio Package Manager URLs (like"https://packagemanager.rstudio.com/all/__linux__/focal/latest"
for the most recent release and"https://packagemanager.rstudio.com/cran/__linux__/focal/296"
for version 4.0.1). - Rocker images do not set
options("BioC_mirror")
, so runningBiocManager::repositories()
shows the default bioconductor repositories, which are tied to the specific version of BiocManager that is installed.BioCsoft "https://bioconductor.org/packages/3.12/bioc" BioCann "https://bioconductor.org/packages/3.12/data/annotation" BioCexp "https://bioconductor.org/packages/3.12/data/experiment" BioCworkflows "https://bioconductor.org/packages/3.12/workflows" CRAN "https://packagemanager.rstudio.com/cran/__linux__/focal/296"
- The RStudio Package Manager repository for bioconductor cannot be frozen to a specific time point. Instead it is recommended to set a standard URL via
options(BioC_mirror = "https://packagemanager.rstudio.com/bioconductor")
and to use a compatible CRAN snapshot (they list appropriate snapshots for given versions of bioconductor here: https://packagemanager.rstudio.com/client/#/repos/4/overview). - Rocker relies on the
install2.r
script for installing most of its packages. This is good, because unlike using other methods such asRUN R -e "install.packages('tidyverse')"
, theinstall2.r
returns a non-zero exit code to the shell when it fails, which stops docker builds. Otherwise, the build would continue just fine and you would end up with a Docker image that is missing your package, without any way of knowing (except for scrolling through the very long and verbose output of the R install process or trying to load the package while running the container).- This raises a side-question: rocker also sets a specific download method in
options(repos)
in Rprofile.site (rocker-versioned2/scripts/install_R.sh
Line 135 in 6d5eed8
install2.r
automatically?
- This raises a side-question: rocker also sets a specific download method in
I believe that this issue is not tied to which specific repositories are being used (RSPM or the default bioconductor ones) and that it could be worthwhile to highlight it somewhere in rocker's guide on modifying and extending the images. E.g. warning users about potential silently failing bioconductor installs by calling BiocManager::install()
and warning about verifying whether using install2.r
with all the individual sub-repositories for bioconductor does what they intend it to do.
Am I going about this the wrong way perhaps? I guess I can just forget about pinning a specific repository and just keep track of my images as the unit I need to store for reproducibility? But then again, rocker images for previous versions of R also pin repo URLs, so that seems to be the intended approach. Any other advice or insight into how I can better handle these installations is highly appreciated!
EDIT: I've cross-posted this to the Bioconductor repository as well, since the same kind of addition to their documentation would be useful imo: Bioconductor/bioconductor_docker#38
Thanks, this mostly sounds accurate. The key distinction here is that, in my understanding, BioC packages are already frozen to the annual R version, much like Ubuntu and other Linux distros do with their default repositories. I believe the bioc installer selects the appropriate repository based on the R version, so the rocker-versioned approach here is basically to leave well enough alone. As we already freeze the R version, the corresponding BioC repo should be determined from that.
Let me know if that makes sense or if I'm missing something. (I'm not an active user of many BioC packages, so I could easily be missing something in my understanding here!)
Agree 💯 that we ought to improve the docs about this in any case
i.e. install bioconductor in (rocker) docker and have the build fail when something goes wrong.
How about modifying the installBioc.r
script to be like the install2.r
script in order to make the build fail when the installation fails?
rocker-versioned2/scripts/bin/install2.r
Lines 81 to 84 in 889a33b
Thanks both for your replies!
How about modifying the installBioc.r script to be like the install2.r script in order to make the build fail when the installation fails?
If that is possible, I'd be stoked! It would solve the major problem I'm facing and also make the behaviour of the script more consistent with not just install2.r
, but also installGithub.r
!
The key distinction here is that, in my understanding, BioC packages are already frozen to the annual R version, much like Ubuntu and other Linux distros do with their default repositories. I believe the bioc installer selects the appropriate repository based on the R version, so the rocker-versioned approach here is basically to leave well enough alone. As we already freeze the R version, the corresponding BioC repo should be determined from that.
That does indeed make sense! I'm quite new to bioconductor myself (or rather, I've never had the need to delve into the way it managages packages), so here's what I've gathered just now:
- Each Bioconductor release is designed to work with a specific version of R. https://bioconductor.org/about/release-announcements/
- Bioconductor has a repository and release schedule that differs from R (Bioconductor has a ‘devel’ branch to which new packages and updates are introduced, and a stable ‘release’ branch emitted once every 6 months to which bug fixes but not new features are introduced). .... The install() function is provided by BiocManager. This is a wrapper around install.packages, but with the repository chosen according to the version of Bioconductor in use, rather than to the version relevant at the time of the release of R. https://www.bioconductor.org/install/#why-biocmanagerinstall
- But how does a CRAN package know what version of Bioconductor is in use? Can we use BiocManager? No, because we don’t have enough control over the version of BiocManager available on CRAN, e.g., everyone using the same version of R would get the same version of BiocManager and hence of Bioconductor. But there are two Bioconductor versions per R version, so that does not work! .... Is there any other way that R could keep track of version information? Yes, by installing a Bioconductor package (BiocVersion) whose sole purpose is to indicate the version of Bioconductor in use. https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html (under the "How it works" section).
In any case, from what I can tell, BiocManager (and BiocVersion) seem to work just fine regardless of whether the bioconductor or the RSPM repository is being used. I.e., users can install a desired version of bioconductor (and will be warned when they try to use a version that is incompatible with the available version of R), and the different repository URLs (BioCSoft, BioCAnn, etc.) will be adjusted automatically (using the repository URL prefix that is set by options("BioC_mirror")
).
So all of that seems to work as intended and I agree with your "leave well enough alone" assessment ;) Apologies for writing out this wall of text, but at the very least it helped me get a better grip on things.
Since these specific peculiarities are pretty much unique to bioconductor, I understand that it's a bit difficult to gauge how much of it needs to be documented by the rocker project as opposed to by bioconductor though... Perhaps, the fact that BioC manager is installed in tidyverse, but that the default repository is retained, alongside a warning on how best to install BioC packages could be worthwhile additions?
The README currently states:
Please install R packages from source using the install.packages() R function or the install2.r script, and use apt only to install necessary system libraries (e.g. libxml2). Do not use apt install r-cran-* to install R packages.
It would be helpful to add info here on Bioconductor package installation (e.g., installBioc.r
).
It would also be helpful to include information on how to install Bioconductor packages when installBioc.r
is not in the PATH for the rocker image (e.g., r-ver:4.2.1
).
Thanks @nick-youngblut ! PR's always welcome, we're a community-driven project.
PR's always welcome, we're a community-driven project.
I can see why you'd like help, given how much of a pain writing documentation can be, but asking for help with documentation from those that are currently looking for the documentation seems like it will lead to documentation edits that do not incorporate best-practices, as defined by the software developers. For instance, I'm currently trying the following:
RUN install2.r --ncpus 2 --error \
argparse ape dplyr tidyr BiocManager && \
R -e 'BiocManager::install("sangeranalyseR")' && \
rm -rf /tmp/downloaded_packages
...but I don't know if it will work (the build is still running) or if it follows best-practices. If it does work, I can create a PR with an updated README, but I'm guessing the person(s) reviewing the PR will just have to heavily edit the changes.
Hey @nick-youngblut , thanks! yup, a PR is a great way for us a community to discuss these things! This is not just because I am too lazy to update the readme, but because that discussion process of issues and PRs usually gets us to a better point that meets the needs of other users, and is also easier for other developers and community members to chime in.
I agree with you that installBioc.r
is probably the best choice for most users, and we should probably start by documenting that more clearly!
Like you note, that's not so helpful since unlike install2.r
or installGithub.r, it's not sym-linked onto the default PATH. These helper utilities are part of
littler, so it's available in
$R_HOME/site-library/littler/examples/installBioc.r` -- and we should probably symlink it in https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/setup_R.sh#L78 I think.
So I was intrigued to see how far r2u could come in help given its partial BioConductor support (and of course famously complete CRAN support). I fired up the eddelbuettel/r2u:jammy
container (to be ported to Rocker "soon") and did
# first command an echo of yours, installs in a few (single) seconds
install.r argparse ape dplyr tidyr BiocManager
# the I tried this which came back with a loooong list of packages so I Ctrl-C'ed out
#Rscript -e 'bspm::disable(); BiocManager::install("sangeranalyseR")'
# instead this installed all available build-deps
# (I had edited the '' and , out of the return from the stopped attempt
install.r sys bitops bit colorspace askpass zlibbioc RCurl GenomeInfoDbData bit64 blob memoise plogr isoband farver labeling munsell curl openssl BH fs rappdirs pixmap sp RcppArmadillo BiocGenerics S4Vectors IRanges XVector GenomeInfoDb crayon RSQLite DBI plyr fastmatch igraph quadprog gtable httpuv mime xtable fontawesome htmltools sourcetools later promises fastmap commonmark bslib cachem ellipsis ggplot2 scales httr viridisLite base64enc htmlwidgets RColorBrewer lazyeval crosstalk jquerylib anytime sass zip evaluate tinytex xfun yaml highr ade4 segmented bookdown Biostrings DECIPHER reshape2 phangorn sangerseqR gridExtra shiny shinydashboard shinyjs data.table plotly DT zeallot excelR shinycssloaders ggdendro shinyWidgets openxlsx rmarkdown knitr BiocStyle logger
# then I could just do -- which was quick
Rscript -e 'bspm::disable(); BiocManager::install("sangeranalyseR")'
Now all is good:
> library(sangeranalyseR)
Loading required package: stringr
Loading required package: ape
Loading required package: Biostrings
Loading required package: BiocGenerics
[.... lots and lots omitted ...]
Loading required package: logger
Welcome to sangeranalyseR
>
It uses current packages, not the 'versioned' stack so it may not be of interest to you. But we can get a of BioC quickly installed, which is still of interest to some.
Apparently, my attempt above does not work. I was able to install the sangeranalyseR
package via R -e 'BiocManager::install("sangeranalyseR")'
, aand the docker image build completed successfully. However, when I try to load the R package in my R script within the image, I get the error:
Error in library("sangeranalyseR") :
there is no package called ‘sangeranalyseR’
...so it appears that the bioconductor package is not installed in the correct libPath. My libPaths when calling the R script:
"/usr/local/lib/R/site-library"
"/usr/local/lib/R/library"
I cannot find the "installed" sangeranalyseR package anywhere in the docker image. The following returns nothing:
find / -iname "sangeranalyseR" 2> /dev/null
...and the package is definitely not in /usr/local/lib/R/site-library/
.
The entire docker file that I'm using:
FROM ubuntu:20.04
FROM rocker/r-ver:4.2.1
# Install OS dependencies
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y \
build-essential
# Install R dependencies
RUN install2.r --ncpus 2 --error \
argparse ape dplyr tidyr ggplot2 purrr furrr data.table tidytable BiocManager && \
R -e 'BiocManager::install("sangeranalyseR")' && \
rm -rf /tmp/downloaded_packages
# CMD
CMD ["/bin/bash", "-c", "R --version"]
🤷♂️
What I showed you was real. I just used eddelbuettel/r2u:jammy
as the base. It does not have those .libPaths()
. If you are in a different environment you need to debug what is different.
(I also tried to throw a quick demo Dockerfile together (just as I had already done once today) but that balked as @Enchufa2 and I currently have an issue with bspm
where it is not as smoothly falling over from some packages not in the repo. Your laundry list of packages implied is really long. It worked for interactively, in building a Dockerfile it balked. Sorry. r2u is real though: I encourage you to play a little. We have 20k CRAN packages, and about 240 BioC. So you can go a long way.)
Well sure if you use rocker/r-ver
than none of this applies. I tried to say so in my first message.
@nick-youngblut I suspect your installation isn't succeeding due to missing system libraries (might be apt-get install -y zlib1g-dev libxml2-dev libglpk-dev
) you'll need to list on your Dockerfile (r2u does this magically 🪄 , but r-ver does not. you could use a more downstream member of the r-ver that includes more of these dependencies by default though)
Recall that R does not throw an error when install.packages()
fails. (note that like install2.r
, installBioc
provides the --error
flag to alter this behavior, which is imperfect but usually best in Dockerfiles)
For instance, this Dockerfile works for me: (though it does take 330 seconds to build)
FROM rocker/verse
# Install R dependencies
RUN install2.r --ncpus 2 --error \
argparse ape dplyr tidyr ggplot2 purrr furrr data.table tidytable BiocManager && \
$R_HOME/site-library/littler/examples/installBioc.r --error sangeranalyseR && \
rm -rf /tmp/downloaded_packages
Thanks @eddelbuettel and @cboettig for all of the help! ...and thanks @cboettig for test-building a dockerfile that works 🚀
@cboettig , is your use of $R_HOME/site-library/littler/examples/installBioc.r
the current best-practice that I should include in my PR to update the docs?
though it does take 330 seconds to build
FYI: it took 1384 sec to build the RUN install2.r ...
layer on my M1 macbook, and the image is 1845.62 MB
r2u does this magically
Not really. r2u
relies on binaries and has them for all of CRAN. I did build sangeranalysisR from source because that one is not among the ~ 240 BioC binaries in r2u.
There are also some BioC folks already using / poking at r2u so you could ask on the BioC slack or lists too for best practices.
As for installBioc.r
, I have several dozen scripts in that littler directory including half a dozen installation helpers. We don't promote all into the path but maybe should. Easy enough for you to add too.
So for completeness, now after dinner, with the following Dockerfile
FROM eddelbuettel/r2u:jammy
## depends per https://www.bioconductor.org/packages/release/bioc/html/sangeranalyseR.html
RUN install.r argparse stringr ape Biostrings DECIPHER reshape2 phangorn gridExtra \
shiny shinydashboard shinyjs data.table plotly DT zeallot excelR shinycssloaders ggdendro \
shinyWidgets openxlsx rmarkdown knitr seqinr BiocStyle logger BiocManager
## now our main target
RUN Rscript -e 'bspm::disable(); BiocManager::install(c("sangerseqR", "sangeranalyseR"))'
we install in 64 seconds.
FYI: it took 1384 sec to build the
RUN install2.r ...
layer on my M1 macbook, and the image is 1845.62 MB
Arm64 platform does not support binary installation of CRAN packages, so installation takes longer.
https://rocker-project.org/images/versioned/r-ver.html#overview
Arm64 platform does not support binary installation of CRAN packages, so installation takes longer.
I ran the build for linux/amd64
:
docker buildx build --push --platform linux/amd64 -t ${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${IMAGE_NAME}:${IMAGE_VERSION} ${IMAGE_NAME}