Edge case in LiftoverVcf causes cryptic string exception
rickymagner opened this issue · 8 comments
Bug Report
Affected tool(s)
LiftoverVcf
Affected version(s)
- Latest public release version [3.1.1] (GATK 4.5.0.0)
Description
This was discovered in trying to debug #1933. There is an edge case in lifting over deletions when they contract a tandem repeat close to the edge of the resulting liftover contig. Consider the entry in a real chain file (details in the steps to reproduce below on where files live):
chain 4900 chr17 83257441 + 22527230 22533041 unplaced_85 5811 - 0 5811 2535
Next consider the variant:
chr17 22533036 . GTA G . . .
The 7 bases on chr17 from 22533036 are: GTATATG, and the first 6 bases on contig unplaced_85 are: ATATACAT. The chain file says that the GTATAT on chr17 corresponds to ATATAC on unplaced_85. The deletion as written in the input VCF corresponds to ATA--C in unplaced_85. When left aligned (as happens in the algorithm), this looks like A--TAC, so should be recorded as:
unplaced_85 1 ATA A . . .
But instead, LiftoverVcf throws an error (see stacktrace below) about a string index out of bounds.
I suspect there is a bug in our left aligning step, but can't see it yet from the code. There is a start - 2
in LiftoverUtils.leftAlignVariant
which looks suspicious, but I haven't pinned that down as the direct cause yet.
Here are some experiments playing with the DEL length:
REF | ALT | RESULT |
---|---|---|
GT | G | Accept: unplaced_85 at POS 4 |
GTA | G | Error |
GTAT | G | Accept: unplaced_85 at POS 2 |
GTATA | G | Error |
GTATAT | G | Error |
GTATATG | G | Reject: NoTarget |
The fact that the smaller "odd" deletions work but the full contractions don't highly suggests something strange is happening during the left alignment after liftover. But the code is a bit complex so it'll take some digging.
Steps to reproduce
Grab an hg38 VCF header and record this variant as the sites-only VCF special_variant.vcf.gz
:
chr17 22533036 . GTA G . . .
Then take the JG2.1.0 reference here with chain file here. Save these to ref.fa.gz
and lift.chain
. Get the right ref indices ready. Then run
gatk LiftoverVcf -I special_variant.vcf.gz -O accept.vcf --REJECT reject.vcf -R ref.fa.gz --CHAIN lift.chain
You will see the stacktrace:
java.lang.StringIndexOutOfBoundsException: offset -1, count 3, length 5811
at java.base/java.lang.String.checkBoundsOffCount(String.java:4591)
at java.base/java.lang.String.<init>(String.java:397)
at htsjdk.samtools.util.StringUtil.bytesToString(StringUtil.java:301)
at picard.vcf.LiftoverVcf.tryToAddVariant(LiftoverVcf.java:558)
at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:453)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:280)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
at org.broadinstitute.hellbender.Main.main(Main.java:306)
Expected behavior
If a variant gets left aligned to the start of a contig, we should handle that either with a uniform algorithm that works everywhere or as a special case.
Actual behavior
The user sees an uninformative stack trace.
nice sleuthing work @rickymagner! IIANM, I'm largely responsible for the lift-over of indels, and so if I can help, let me know.
Hi @yfarjoun, if you do have some time to look at this, that would be great. I suspect the part of the code causing this error is located in LiftoverUtils.java
at L389 here:
// 5. if there exists an empty allele then
if (alleleBasesMap.values().stream()
.map(a -> a.length)
.anyMatch(l -> l == 0)) {
// 6. extend alleles 1 nucleotide to the left
for (final Allele allele : alleleBasesMap.keySet()) {
// the first -1 for zero-base (getBases) versus 1-based (variant position)
// another -1 to get the base prior to the location of the start of the allele
final byte extraBase = (theStart > 1) ?
referenceSequence.getBases()[theStart - 2] :
referenceSequence.getBases()[theEnd];
alleleBasesMap.put(allele, extendOneBase(alleleBasesMap.get(allele), extraBase));
}
changesInAlleles = true;
theStart--;
// 7. end if
}
The theStart - 2
looks like it could suspiciously lead to a string index becoming -1 when theStart == 1
, but this could also be a red herring since the conditional which triggers this block might not be satisfied by the given variant. I haven't been able to look too deeply at the code around here to understand what's actually happening.
This is not the moment the stacktrace is thrown of course, but I'm wondering if it's the place where that -1 appears and then propagates further to the moment later in tryToAddVariant
where the string constructor is given index -1 to start which triggers the error. Otherwise I'm not sure how that number would ever become -1 yet.
I'll see what I can do. do you have a test-case that fails? I know you got a vcf from a user, but have you converted it to an additional unit test? I think that would be a good first step, so if you did it, please point me to it, if not, I can try my hand.
Thanks. Yes, I would take the files from the issue above (the JP ref + chain file linked from the public URL; should be small enough to download), and then take the bad deletion variant I wrote on chr17 and add an hg38 header to it. Any header should be fine, but didn't want to post a complete example since there's 3k contigs there. I haven't turned it into a formal Java test yet, but might be able to try that this week (using the example above, or rewriting one using whatever resource files we already have in the test suite).
This is resolved by #1933
@serge2016 I checked it works for the specific edge case variant I found before. Let us know if there are any other variants that give an issue if you try to rerun on it. I'm not sure when the next Picard release will be, but you can test building from master
if you're interested in seeing if the new version of the tool can get through your entire file.