Genome and other 'external' files

Question

Genome and other 'external' files

Closed this issue 4 years ago · 4 comments

As (partly) discussed in #30 (comment) we currently have two ways of handling genome files in our modules / workflows, using params or via input/output channels. I don't think we have 'clashing' tools at the moment (DNA tools use params and RNA tools use channels). However, to make sure we don't get clashes in the future I do think we need to chose which method we use in our modules.

To be honest I do not really see the benefit of using channels for these kind of files in our (DX) workflows. These are static files which in our case should almost never change, so adding them as a channel would only make our workflow more complex. I also think that debugging becomes harder if using channels for these files as the command.run files won't contain the full path to these files anymore. This also the case for bam / vcf headers etc.

I do think that by using channels for these files make the workflow more/easier portable and/or easier to change genome builds. That is I think the main reason why the 'nf-core' workflows use this pattern. I am not sure whether this is worth the added complexity for our workflows. We are more thinking in the direction of a 'deploy workflow' script which would install all required files and repositories.

Answer 1 · 2020-05-21T07:48:32.000Z

I tend to agree on the complexity part here. I like the idea of a deployment workflow that checks and builds all the necessary files based on the options given to a specific pipeline.

Example:
I want to do mapping + variant calling and annotation, the deployment workflow first checks if a genome was properly build & all annotation files are present. If not it will build these specifically.

If the output location of all files generated this way is always the same (same pattern at least), you don't have to use channels for resource files. Another side note, it's important to use the storeDir in your deploy processes so it doesn't redo the deployment when a file is already present. Have a look at the BWA/0.7.17/Index.nf script for an example.

Answer 2 · 2020-05-22T11:42:09.000Z

I do think the usage of Channels for resource files makes sense in certain situations and might not be required/necessary in others.

Case 1) Primary resource files

As stated by @rernst, if you have a very static workflow that deploys all required files in advance and relies on primary resource files (Fasta, GTF's etc.) then I do not see the need to refactor the process to use File Channels.

Case 2) Secondary resource files

The situation becomes different if you want to build derived resources (STAR, Salmon, BWA etc.) at run-time. In this case, the resource files become no different from regular process inputs that connects two processes via async communication and should be handled as such. The alternative would be to include a certain process B after channel A has generated the required resources and set the file path within params on include level. This seems a rather counter-intuitive and cumbersome solution. I am not even sure if this would work correctly since the whole point of channels is to connect processes and handle correct parallelization etc.

I do not use a fixed storeDir yet for large genome files. Currently, they are stored in the output folder and the user has to move the resource files to the desired location and create a config file for the next run. Indeed a good idea.

I do think that providing file inputs as channels improved readability, portability and overall cleanness of implementation. We might think of applying this design pattern only for Case 2 situations. I like the fact of having the full path of the original genome files within
the .command.sh script.

Answer 3 · 2020-05-25T12:04:28.000Z

I am in doubt, some thoughts:

BWA is actually different from Star and Salmon because the input is the genome.fasta not the index build by bwa index. BWA just expects the index to be there in the same folder as the genome.fasta, this is the same for GATK.
storeDir looks good, I do think we should configure this in the workflow.config not in the .nf files, just like publishDir. This allows the workflow developer to decide what he/she wants to do with these files.
I don't really see the difference between primary / secondary files which @tilschaef is referring to. Also our genome fasta files are created by a 'process'. Just like a index you have to prepare these in a certain way which I don't think should be included in a sample processing workflow. I would like my workflow to crash if certain index files are not there instead of creating them.
I do not agree that channels improve readability and cleanness of the code, I actually see a lot of channel handling code in the RNAseq workflow (mixed with params stuff) which I do not need at all in my current workflows. I do agree that it (in theory) improves portability, but I am not sure whether this is worth all the extra code. Besides that you would still need a reference genome, so I am not sure why you then would not supply the index files as well instead of having to regenerate them.

Maybe it is an idea to do the following:

For consistency always use params to define 'static' files, so genome.fasta, genome.index, gtf, .bed, .vcf (annotations).
Then if you would like to be able build these files in your workflow:
- Check if file exists -> if not: create them and save them with storeDir. The only problem is that I don't know how you can make the other processes wait on this.
- Or create a small workflow creating all these files, you can probably use the same config.

Answer 4 · 2020-05-26T12:48:29.000Z

Summary of zoom call:

Use params for resource files.
Create a workflow to prepare/build all required genomes files.

I will update the readme.