Issue while saving variants as vcf.

Question

Issue while saving variants as vcf.

Arun-George-Zachariah opened this issue 5 years ago · 8 comments

Arun-George-Zachariah commented 5 years ago

Hi,

With the Cannoli CLI, while trying to save the variants using freebayes as vcf, I get the error:

java.lang.IllegalStateException: Key CIGAR found in VariantContext field INFO at chr3:75621022 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.
	at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:202)
	at htsjdk.variant.vcf.VCFEncoder.write(VCFEncoder.java:141)
	at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:248)
	at org.seqdoop.hadoop_bam.VCFRecordWriter.writeRecord(VCFRecordWriter.java:148)
	at org.seqdoop.hadoop_bam.KeyIgnoringVCFRecordWriter.write(KeyIgnoringVCFRecordWriter.java:60)
	at org.seqdoop.hadoop_bam.KeyIgnoringVCFRecordWriter.write(KeyIgnoringVCFRecordWriter.java:38)
	at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:358)
	at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:132)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
	at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:129)
	at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:83)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

How do I go about this?

Answer 1 · 2020-10-13T18:02:31.000Z

I faced the same problem. Based on my investigation the root cause of this issue is replacing default headerLines values of variantContexts with accumulator 's values which is actually empty. Then it ends up without any stored headerLines . When it comes to saving as vcf, it is expected to have the same output headers as stored in this object, otherwise mentioned error is encountered.

val variantContexts = alignments.pipe[VariantContext, VariantContextProduct, VariantContextDataset, BAMInFormatter](
      cmd = builder.build(),
      files = builder.getFiles()
    )

    val headerLines = accumulator.value.distinct
    variantContexts.replaceHeaderLines(headerLines)

The thing is that accumulator is not updated yet because there was no action performed on given variantContexts. It seems that should be maintained. @heuermh, could you please have a look at this?

Answer 2 · 2020-10-13T20:23:59.000Z

Hello @dmaziec1, your analysis of the issue is correct. I don't have a good idea for a fix -- accumulators are problematic for several reasons.

As a workaround, it might be possible to read in an empty freebayes VCF file and use its headerLines to replace the empty ones set from the accumulator before writing out VCF.

Answer 3 · 2020-10-22T01:55:53.000Z

I posted another workaround in #273, please let me know if that works for you.

Answer 4 · 2020-10-24T10:38:55.000Z

Yes, it works fine for me

Answer 5 · 2020-10-26T18:08:09.000Z

Thanks @dmaziec1, for bringing this issue back to life. And thanks @heuermh, for the changes done. For some reason, I still get the same error.

I was able to work around this issue with a minor change in Adam as shown here.

While this worked, the output vcf obtained was not as rich as when I use vanilla FreeBayes.

@dmaziec1, could you share some more info on the genome sequence being used?

Answer 6 · 2020-10-26T19:40:33.000Z

While testing #273 I ran some performance tests, albeit only in local mode on my laptop

https://gist.github.com/heuermh/ba33a061957e99b6719ed9712a7ace67

This line from the pull request appends together freebayes-specific header lines and default header lines, which include the CIGAR VCF INFO header line apparently missing in the stack trace above.

Answer 7 · 2020-10-26T20:58:08.000Z

Many thanks @heuermh. It was a mistake from my side. I was using the master branch, instead of freebayes-headers.

The changes work perfectly, on a 16 node cluster as well. Also, thanks for pointing out the performance tests. They further justify and strengthen the need for Adam and Cannoli.

Answer 8 · 2020-10-26T21:43:20.000Z

Thank you, @Arun-George-Zachariah and @dmaziec1!