Spark tools are officially recommended to not be used for production

Question

Spark tools are officially recommended to not be used for production

Closed this issue 6 years ago · 7 comments

Technically we cannot guarantee to the end user that the output from the spark enabled tools is of high quality. The log file from HaplotypeCallerSpark says this:

The output of this tool DOES NOT match the output of HaplotypeCaller.
It is under development and should not be used for production work.
For evaluation only.
Use the non-spark HaplotypeCaller if you care about the results.

BaseRecalibratorSpark says this:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Warning: BaseRecalibratorSpark is a BETA tool and is not yet ready for use in production

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

If we want to ship with the spark enabled versions while they're still in beta status I think we should inform the end user that the results aren't guaranteed to have the same quality as the non spark HaplotypeCaller version.

Answer 1 · 2018-12-11T14:45:45.000Z

This is critical information! If this does not only apply to the initial SparkMappongStepModule, but also downstream modules like HTC, we need non-spark version 4 tools only in release 1 milestone Den tir. 11. des. 2018, 13:16 skrev Oskar Vidarsson < notifications@github.com:

…

Technically we cannot guarantee to the end user that the output from the spark enabled tools is of high quality. The log file from HaplotypeCallerSpark says this: The output of this tool DOES NOT match the output of HaplotypeCaller. It is under development and should not be used for production work. For evaluation only. Use the non-spark HaplotypeCaller if you care about the results. BaseRecalibratorSpark says this: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Warning: BaseRecalibratorSpark is a BETA tool and is not yet ready for use in production !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! If we want to ship with the spark enabled versions while they're still in beta status I think we should inform the end user that the results aren't guaranteed to have the same quality as the non spark HaplotypeCaller version. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#22>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEB3fjVWcPV7Otcvk8POIxN47oQTl5Niks5u36IwgaJpZM4ZNSZg> .

Answer 2 · 2018-12-12T09:17:08.000Z

All spark tools are in beta still so in that case we can't use them.

The issue with having only non-spark tools is the running time if we just run them as they are with no parallelization implemented. It will be prohibitively slow unfortunately. And to my understanding there is no possibility to implement scatter gather parallelization in rbFlow, so that leaves us with two choices:

Use rbFlow with non-spark tools and accept the increased execution time.
Abandon rbFlow completely and rebuild the workflow in a new language. If Snakemake, WDL, Nextflow or bash is acceptable I have options that have pros and cons. But perhaps CWL is the only acceptable option?

Answer 3 · 2018-12-12T14:57:19.000Z

Any timeline on when GTK 4.0 spark support is expected to be ready for production at all? Den ons. 12. des. 2018, 10:17 skrev Oskar Vidarsson < notifications@github.com:

…

All spark tools are in beta still so in that case we can't use them. The issue with having only non-spark tools is the running time if we just run them as they are with no parallelization implemented. It will be prohibitively slow unfortunately. And to my understanding there is no possibility to implement scatter gather parallelization in rbFlow, so that leaves us with two choices: 1. Use rbFlow with non-spark tools and accept the increased execution time. 2. Abandon rbFlow completely and rebuild the workflow in a new language. If Snakemake, WDL, Nextflow or bash is acceptable I have options that have pros and cons. But perhaps CWL is the only acceptable option? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#22 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEB3fk_kJph2Uqp-Jf6f5G3c-e9bq_jPks5u4MmUgaJpZM4ZNSZg> .

Answer 4 · 2018-12-13T09:07:19.000Z

I have asked for a release timeline here: https://gatkforums.broadinstitute.org/gatk/discussion/11245/spark?

If the ETA is very far into the future it may be worth exploring alternatives so that we at least use the time for something productive.

Answer 5 · 2018-12-13T10:01:22.000Z

It's possible to fork processes in ruby: http://rubyforadmins.com/background-jobs
So I think we can make a for loop, spawn one process per sub-interval list, wait for them to finish and then proceed with the next tool. We would need to implement this workflow in rbFlow: https://raw.githubusercontent.com/oskarvid/snakemake_germline/master/dag.png

In particular we need to do the following changes:

Add a FastqToSam step to create an unmapped bam file
Combine the mapped and unmapped bam files with MergeBamAlignment
Combine the MergeBamAlignment outputs with MarkDuplicates
Scatter BaseRecalibration by running a for loop that forks each process to the background
Gather the outputs with GatherBQSR
Repeat the same principle in step 4 and 5 for ApplyBQSR and HaplotypeCaller, we can do this for GenotypeGVCFs too if we want to, it saves about 30 minutes I think.

I understand that this is a lot of work, but it's probably possible, so I want to at least present the possibility so that we know it's a potential option.

Answer 6 · 2019-01-07T07:55:40.000Z

I got a reply on my question about a Spark stable release ETA:

Hi @oskarv,

This is hard to say. Others have asked similarly here and here. My guess is it will be a while. In the meantime, we hope you do try out the BETA versions of these Spark implementations and let us know what you think. It's feedback from researchers like yourself that really helps drive the development of our tools forward.
https://gatkforums.broadinstitute.org/gatk/discussion/comment/54849/#Comment_54849

So what do we do now?

Answer 7 · 2019-02-21T09:44:27.000Z

Since we've decided to not use the spark tools for this release I'm closing this issue.