elixir-no-nels/rbFlow-Germline

Spark tools are officially recommended to not be used for production

Closed this issue · 7 comments

Technically we cannot guarantee to the end user that the output from the spark enabled tools is of high quality. The log file from HaplotypeCallerSpark says this:

The output of this tool DOES NOT match the output of HaplotypeCaller.
It is under development and should not be used for production work.
For evaluation only.
Use the non-spark HaplotypeCaller if you care about the results.

BaseRecalibratorSpark says this:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Warning: BaseRecalibratorSpark is a BETA tool and is not yet ready for use in production

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

If we want to ship with the spark enabled versions while they're still in beta status I think we should inform the end user that the results aren't guaranteed to have the same quality as the non spark HaplotypeCaller version.

All spark tools are in beta still so in that case we can't use them.

The issue with having only non-spark tools is the running time if we just run them as they are with no parallelization implemented. It will be prohibitively slow unfortunately. And to my understanding there is no possibility to implement scatter gather parallelization in rbFlow, so that leaves us with two choices:

  1. Use rbFlow with non-spark tools and accept the increased execution time.
  2. Abandon rbFlow completely and rebuild the workflow in a new language. If Snakemake, WDL, Nextflow or bash is acceptable I have options that have pros and cons. But perhaps CWL is the only acceptable option?

I have asked for a release timeline here: https://gatkforums.broadinstitute.org/gatk/discussion/11245/spark?

If the ETA is very far into the future it may be worth exploring alternatives so that we at least use the time for something productive.

It's possible to fork processes in ruby: http://rubyforadmins.com/background-jobs
So I think we can make a for loop, spawn one process per sub-interval list, wait for them to finish and then proceed with the next tool. We would need to implement this workflow in rbFlow: https://raw.githubusercontent.com/oskarvid/snakemake_germline/master/dag.png

In particular we need to do the following changes:

  1. Add a FastqToSam step to create an unmapped bam file
  2. Combine the mapped and unmapped bam files with MergeBamAlignment
  3. Combine the MergeBamAlignment outputs with MarkDuplicates
  4. Scatter BaseRecalibration by running a for loop that forks each process to the background
  5. Gather the outputs with GatherBQSR
  6. Repeat the same principle in step 4 and 5 for ApplyBQSR and HaplotypeCaller, we can do this for GenotypeGVCFs too if we want to, it saves about 30 minutes I think.

I understand that this is a lot of work, but it's probably possible, so I want to at least present the possibility so that we know it's a potential option.

I got a reply on my question about a Spark stable release ETA:

Hi @oskarv,

This is hard to say. Others have asked similarly here and here. My guess is it will be a while. In the meantime, we hope you do try out the BETA versions of these Spark implementations and let us know what you think. It's feedback from researchers like yourself that really helps drive the development of our tools forward.
https://gatkforums.broadinstitute.org/gatk/discussion/comment/54849/#Comment_54849

So what do we do now?

Since we've decided to not use the spark tools for this release I'm closing this issue.