Issue while saving variants as vcf.
Arun-George-Zachariah opened this issue · 8 comments
Hi,
With the Cannoli CLI, while trying to save the variants using freebayes
as vcf
, I get the error:
java.lang.IllegalStateException: Key CIGAR found in VariantContext field INFO at chr3:75621022 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.
at htsjdk.variant.vcf.VCFEncoder.fieldIsMissingFromHeaderError(VCFEncoder.java:202)
at htsjdk.variant.vcf.VCFEncoder.write(VCFEncoder.java:141)
at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:248)
at org.seqdoop.hadoop_bam.VCFRecordWriter.writeRecord(VCFRecordWriter.java:148)
at org.seqdoop.hadoop_bam.KeyIgnoringVCFRecordWriter.write(KeyIgnoringVCFRecordWriter.java:60)
at org.seqdoop.hadoop_bam.KeyIgnoringVCFRecordWriter.write(KeyIgnoringVCFRecordWriter.java:38)
at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:358)
at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$executeTask$1(SparkHadoopWriter.scala:132)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:129)
at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:83)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
How do I go about this?
I faced the same problem. Based on my investigation the root cause of this issue is replacing default headerLines
values of variantContexts
with accumulator
's values which is actually empty. Then it ends up without any stored headerLines
. When it comes to saving as vcf, it is expected to have the same output headers as stored in this object, otherwise mentioned error is encountered.
val variantContexts = alignments.pipe[VariantContext, VariantContextProduct, VariantContextDataset, BAMInFormatter](
cmd = builder.build(),
files = builder.getFiles()
)
val headerLines = accumulator.value.distinct
variantContexts.replaceHeaderLines(headerLines)
The thing is that accumulator
is not updated yet because there was no action performed on given variantContexts
. It seems that should be maintained. @heuermh, could you please have a look at this?
Hello @dmaziec1, your analysis of the issue is correct. I don't have a good idea for a fix -- accumulators are problematic for several reasons.
As a workaround, it might be possible to read in an empty freebayes VCF file and use its headerLines
to replace the empty ones set from the accumulator before writing out VCF.
Yes, it works fine for me
Thanks @dmaziec1, for bringing this issue back to life. And thanks @heuermh, for the changes done. For some reason, I still get the same error.
I was able to work around this issue with a minor change in Adam as shown here.
While this worked, the output vcf
obtained was not as rich as when I use vanilla FreeBayes.
@dmaziec1, could you share some more info on the genome sequence being used?
While testing #273 I ran some performance tests, albeit only in local mode on my laptop
https://gist.github.com/heuermh/ba33a061957e99b6719ed9712a7ace67
This line from the pull request appends together freebayes-specific header lines and default header lines, which include the CIGAR VCF INFO header line apparently missing in the stack trace above.
Many thanks @heuermh. It was a mistake from my side. I was using the master
branch, instead of freebayes-headers
.
The changes work perfectly, on a 16 node cluster as well. Also, thanks for pointing out the performance tests. They further justify and strengthen the need for Adam and Cannoli.
Thank you, @Arun-George-Zachariah and @dmaziec1!