r2dt-bio/R2DT

Bundle Rfam seeds into cms tarball?

afg1 opened this issue · 2 comments

afg1 commented

As part of the RNAcentral pipeline, we run R2DT in a nextflow process that makes use of a singularity container converted from the docker container built in this repo.

We currently set up for execution by downloading and expanding cms.tar.gz and bind-mounting the resulting folder to the correct place within the R2DT container. This worked in previous versions of R2DT (<1.3).

Now that R2DT can transfer pseudoknots, it needs access to the Rfam seed files. As these are not contained in the prepared data, R2DT attempts to download them. However, the directory /rna/r2dt/data/rfam/ is not writeable, since it is within the singularity container and not bind-mounted.

Would it be possible to include the Rfam seeds in the precomputed library?

@afg1 Thanks for raising the issue Andrew!

Bundling Rfam seeds with the downloadable files is a good solution as it will increase the size only marginally. However, this is a potentially breaking change because some people might have already downloaded the precomputed library and would not know that they need to download a new file. 🤔

As a workaround, R2DT could first check in /rna/r2dt/data/cms and if the Rfam seed files are not there, R2DT would try to download them as it does now, but if you have a new precomputed library the files would be present and no download will be needed.

I can look into it in the next couple of days and make a new release.

@afg1 also reported another issue related to Rfam and network requests:

requests.exceptions.InvalidSchema: No connection adapters were found for 'ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/database_files/family.txt.gz'

All Rfam network requests should be eliminated and Rfam files should be bundled into the precomputed library.