broadinstitute/picard

Edge case in LiftoverVcf causes cryptic string exception

rickymagner opened this issue · 8 comments

Bug Report

Affected tool(s)

LiftoverVcf

Affected version(s)

  • Latest public release version [3.1.1] (GATK 4.5.0.0)

Description

This was discovered in trying to debug #1933. There is an edge case in lifting over deletions when they contract a tandem repeat close to the edge of the resulting liftover contig. Consider the entry in a real chain file (details in the steps to reproduce below on where files live):

chain   4900    chr17   83257441        +       22527230        22533041        unplaced_85     5811    -       0       5811    2535

Next consider the variant:

chr17   22533036        .       GTA     G       .       .       .

The 7 bases on chr17 from 22533036 are: GTATATG, and the first 6 bases on contig unplaced_85 are: ATATACAT. The chain file says that the GTATAT on chr17 corresponds to ATATAC on unplaced_85. The deletion as written in the input VCF corresponds to ATA--C in unplaced_85. When left aligned (as happens in the algorithm), this looks like A--TAC, so should be recorded as:

unplaced_85   1    ATA   A   .    .    .

But instead, LiftoverVcf throws an error (see stacktrace below) about a string index out of bounds.

I suspect there is a bug in our left aligning step, but can't see it yet from the code. There is a start - 2 in LiftoverUtils.leftAlignVariant which looks suspicious, but I haven't pinned that down as the direct cause yet.

Here are some experiments playing with the DEL length:

REF ALT RESULT
GT G Accept: unplaced_85 at POS 4
GTA G Error
GTAT G Accept: unplaced_85 at POS 2
GTATA G Error
GTATAT G Error
GTATATG G Reject: NoTarget

The fact that the smaller "odd" deletions work but the full contractions don't highly suggests something strange is happening during the left alignment after liftover. But the code is a bit complex so it'll take some digging.

Steps to reproduce

Grab an hg38 VCF header and record this variant as the sites-only VCF special_variant.vcf.gz:

chr17   22533036        .       GTA     G       .       .       .

Then take the JG2.1.0 reference here with chain file here. Save these to ref.fa.gz and lift.chain. Get the right ref indices ready. Then run

gatk LiftoverVcf -I special_variant.vcf.gz -O accept.vcf --REJECT reject.vcf -R ref.fa.gz --CHAIN lift.chain

You will see the stacktrace:

java.lang.StringIndexOutOfBoundsException: offset -1, count 3, length 5811
	at java.base/java.lang.String.checkBoundsOffCount(String.java:4591)
	at java.base/java.lang.String.<init>(String.java:397)
	at htsjdk.samtools.util.StringUtil.bytesToString(StringUtil.java:301)
	at picard.vcf.LiftoverVcf.tryToAddVariant(LiftoverVcf.java:558)
	at picard.vcf.LiftoverVcf.doWork(LiftoverVcf.java:453)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:280)
	at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)
	at org.broadinstitute.hellbender.Main.main(Main.java:306)

Expected behavior

If a variant gets left aligned to the start of a contig, we should handle that either with a uniform algorithm that works everywhere or as a special case.

Actual behavior

The user sees an uninformative stack trace.

nice sleuthing work @rickymagner! IIANM, I'm largely responsible for the lift-over of indels, and so if I can help, let me know.

Hi @yfarjoun, if you do have some time to look at this, that would be great. I suspect the part of the code causing this error is located in LiftoverUtils.java at L389 here:

            // 5. if there exists an empty allele then
            if (alleleBasesMap.values().stream()
                    .map(a -> a.length)
                    .anyMatch(l -> l == 0)) {
                // 6. extend alleles 1 nucleotide to the left
                for (final Allele allele : alleleBasesMap.keySet()) {
                    // the first -1 for zero-base (getBases) versus 1-based (variant position)
                    // another   -1 to get the base prior to the location of the start of the allele
                    final byte extraBase = (theStart > 1) ?
                            referenceSequence.getBases()[theStart - 2] :
                            referenceSequence.getBases()[theEnd];

                    alleleBasesMap.put(allele, extendOneBase(alleleBasesMap.get(allele), extraBase));
                }
                changesInAlleles = true;
                theStart--;

                // 7. end if
            }

The theStart - 2 looks like it could suspiciously lead to a string index becoming -1 when theStart == 1, but this could also be a red herring since the conditional which triggers this block might not be satisfied by the given variant. I haven't been able to look too deeply at the code around here to understand what's actually happening.

This is not the moment the stacktrace is thrown of course, but I'm wondering if it's the place where that -1 appears and then propagates further to the moment later in tryToAddVariant where the string constructor is given index -1 to start which triggers the error. Otherwise I'm not sure how that number would ever become -1 yet.

I'll see what I can do. do you have a test-case that fails? I know you got a vcf from a user, but have you converted it to an additional unit test? I think that would be a good first step, so if you did it, please point me to it, if not, I can try my hand.

Thanks. Yes, I would take the files from the issue above (the JP ref + chain file linked from the public URL; should be small enough to download), and then take the bad deletion variant I wrote on chr17 and add an hg38 header to it. Any header should be fine, but didn't want to post a complete example since there's 3k contigs there. I haven't turned it into a formal Java test yet, but might be able to try that this week (using the example above, or rewriting one using whatever resource files we already have in the test suite).

This is resolved by #1933

I suppose it is resolved by #1956

@serge2016 I checked it works for the specific edge case variant I found before. Let us know if there are any other variants that give an issue if you try to rerun on it. I'm not sure when the next Picard release will be, but you can test building from master if you're interested in seeing if the new version of the tool can get through your entire file.