adamewing/bamsurgeon

Less variant reads than expected with specified VAF

aldosr opened this issue · 14 comments

Hi Adam,
I'm trying to insert an SNP in a BAM file (30.000X coverage), using the following input:

chr17 7674230 7674230 1 T

With REF=C and hg=hg38.
As you can see from the log it takes the correct input VAF and it creates the correct number of mutated reads:

haplo_chr17_7674230_7674230 selected VAF: 1.000000
haplo_chr17_7674230_7674230 picked: 18975
haplo_chr17_7674230_7674230 wrote: 18976, mutated: 18975

However, it looks like only half of the reads are actually being used for the variant locus (is it adding a heterozygous variant?):
IGV

These are the input parameters:

addsnv.py -v input.txt \
        -f "${BAM}" \
        -r "${FASTA}" \
	-o output.bam \
        --picardjar "${PICARDJAR}" \
        --aligner mem \
        --alignopts c:250,M:,t:${SLURM_JOB_CPUS_PER_NODE},v:1 \
        -p "${SLURM_JOB_CPUS_PER_NODE}" \
        --mindepth 8 \
        --maxdepth 30000 \
        --minmutreads 4 \
        --seed 1234567890

This looks like targeted or amplicion sequencing data, is that correct? If so, do you know what protocol?

It's an artificial dataset created with NEAT and based on regions of a targeted kit.

Hi Adam,
I am running into a similar problem. A lot fewer mutated reads show up in the output bam than what's in the output vcf and the logs. I haven't found a good way to test this other than loading the bam into IGV or running samtools mpileup on every position in the vcf file and verifying that there are actually alts present. In my case, over half are ref-only pileups. the coverage is the same as in the original input bam. It looks like re-mapping or merging of the reads doesn't produce the expected result. Please advise.

I tested this behavior some more, and it seems to be prevalent around neighboring SNVs, i.e. cases where the input .bed file has entries like
1 11182158 11182158 0.22
1 11182179 11182179 0.05

Even when running with --haplosize 300 --force --insane, these two variants make it into the output vcf file, but final bam is missing one of them. Here are the output vcf entries and the pileups on the two positions:

1 11182158 . A C 100 PASS SOMATIC;VAF=0.20588235294117646;DPR=33.5 GT 0/1
1 11182179 . G A 100 PASS SOMATIC;VAF=0.08823529411764706;DPR=34.0 GT 0/1

1 11182158 N 34 a$aAAcAaAaaaAAAAaaaAAaAaAAAAaAAaAAA ABGGEFCFCDCDEDEDDC>CCCECCDDFCCGDE?
1 11182179 N 34 GggaGGGGgggGGgGgAGGGGGgGGGggGggGGa BEEEJFJFBEC@ICHEGHGGFFFFFFH<BGGFEC

Notice how the first locus has only one C-alt while it was targeted at VAF = 0.22 and reported at VAF 0.20

So I went on to look at the intermediate bams, namely the predecessor .muts.bam, and saw that the depth of coverage at those positions is almost double that of the original:

1 11182158 N 60 a$aAAcAaAAcacaaaAAAAAAAAaaaaaaAAAAaaAAaaCACAAACAaAACAaaAAAAAA ABGGEFCFFCCDDCCDDEEDDEEDDDCDC>C>CCCCCEECCCCDDDDFCCCCGGDDEE??
1 11182179 N 60 GGgggggaGGGGGGGGggggggGGGGggGGggGAGGGGGGGGGGggGGGGGGggGggGGa BBEEEEEEJJFFJJFFBBECEC@I@ICCHHEEGGHHGGGGFFFFFFFFFFFFH<BGGFEC

I only see this depth explosion around loci with neighboring SNVs, which leads me to suspect that the reads are gathered redundantly (later on I did verify this). So, perhaps, during the merging of the .muts.bam into the final bam this redundancy is not handled correctly which causes some (or all) mutated copies to fall out?

to follow up, there also seems to be a non-deterministic component at play here. Running the same command twice produced different mutants, both at much lower AF than reported in the vcf. Here's the diff on the pileups in the same two positions I'm exploring above.


1       11182158        N       34      a$aAAtAaAaaaAAAAaaaAAaAaAAAAaAAaAAA   | 1       11182158        N       34      a$aAAaAaAaaaAAAAaaaAAaAaAAAAaAAaAAA
1       11182179        N       34      GgggGGGGgggGGgGgGGAGGAgGGGggGgaGGg    | 1       11182179        N       34      GgggGGGTgttGGgGgGGGGGGgGGGggGggGGg

As you can see, the depth is the same on the right and the left but different reads are mutated.

Hi there, sorry I haven't been able to get to this quickly. What kind of data are you using e.g. WGS or targeted sequencing? What are the command-line arguments? Generally the best way for me to troubleshoot these sort of issues is to have a small .bam file and mutation inputs sufficient to replicate the problem, would you be able to provide this? Finally, you can make mutations deterministically by setting a seed (--seed from memory but see -h output if that's not right)

I should have noted - please only send data if it's based on a public data set (i.e. already openly downloadable somewhere) and is not based on patient data.

If that's all good then yes you could drop it here. I would say set a window a few times larger than your library size around the mutation i.e. samtools view -b chrN:NNNN-NNNN > test.bam, remove unpaired reads (can do this with bamsurgeon/scripts/remove_unpaired.py) and test that you can reproduce the issue on this tiny .bam file. If that works, then zip it up with the input file and upload here. Not sure about the filesize limit but unless your .bam is super deep targeted seq or something should be OK.

I could reproduce the problem but had to include multiple target loci. I appears that the sort order plays into it somehow because if I minimize the set to only two neighboring loci, the output is correct. Attached is the data set along with the intermediate and the final output bam. Here's the command I used. The three flags at the end were my attempt to relax any depth ration restrictions, but they really didn't make a difference in this case.

addsnv.py -v 16_loci.bed -f NA12878.1:11169375-11227530.bam -r /reference/env/dev/content/hs37d5/indexes/bwa/hs37d5.fa -o NA12878.1:11169375-11227530.bam.post-surgery.20230419_182008.bam -p 8 --tagreads --insane --minmutreads 0 --aligner mem --alignerthreads 8 --picardjar /home/ggoltsman/utils/bamSurgeon/picard-tools-1.131/picard.jar --force --insane --haplosize 300

Also, if you notice, I'm using as the reference our indexed human build 37. I didn't include it in the tarball as it's pretty bulky, but you can probably use any hs37 you can get your hands on. Please let me know if you need anything else.

NA12878.1_11169375-11227530.bamSurgeon_issue_184.tar.gz

p.s. As to why I'm using --force, that's the only option I found that results in both neighboring loci 1:11182158 and 1:11182179 getting accepted (i.e. both end up in the vcf file).

Hi Adam,
Another issue I've discovered which may or may not be related to this topic but is affecting the "visibility" of injected mutant reads is that, after re-mapping, the SAM flags are sometimes not set properly. For example, here, the read pair not only got it's mapping scores slashed (this part is fine) but the bitwise flag in the second position lost the 'properly paired' bit.

`$ ~/bin/samtools view my_starting.bam 15:22490135-22490135 | grep A00780:906:HLW7WDSX5:1:2323:12084:18818 | cut -f1,2,3,4,5

A00780:906:HLW7WDSX5:1:2323:12084:18818 99 15 22489996 60
A00780:906:HLW7WDSX5:1:2323:12084:18818 147 15 22490003 60

$ ~/bin/samtools view addsnv.9b14cb49-a7ba-449c-8f1e-5c4091f8a631.muts.bam 15:22490135-22490135 | grep A00780:906:HLW7WDSX5:1:2323:12084:18818 | cut -f1,2,3,4,5

A00780:906:HLW7WDSX5:1:2323:12084:18818 97 15 22489996 32
A00780:906:HLW7WDSX5:1:2323:12084:18818 145 15 22490003 32
`
This causes the reads to be rejected by downstreams tools, e.g. mpileup, that rely on the SAM flags, You can see that the mapped positions didn't change, so the reads should still be deemed "properly paired". Do you think this is purely a bwa artifact or is this somehow related to the way addsnv.py partitions the mutant reads?

Hi again, the flags all come from bwa, I suspect it's something related to a mismatch between the tool/parameters used to align the reads initially vs what bamsurgeon did (current defaults are painfully out of date). Haven't forgotten about the previous query either, still haven't gotten to it so apologies for that. I'll look into the specific reasons it's failing but in general though, if you have to use --force chances are the resulting mutations won't have been done properly.

Before I forget, I just wanted to comment that I found a bug in Bamsurgeon that might account for this behavior.

The issue occurs in replace_reads.py when loading reads from the "donorbam" (the aligned merged BAM created from reads with mutations added).

When populating the rdict (see here) the read IDs are used as keys. However, because mutations can be introduced in the same reads in different places, the same read ID can appear more than once in the donorbam. When a read ID is processed that was previously added to the rdict, it replaces the previous read in the rdict. This causes some fraction of reads with added mutations to be "lost", causing the relevant allele frequencies to drop below the set value.

Without a more in-depth review of the code, I can't think of a quick fix for this issue. At first, I thought that you could add a function to "merge reads" whenever the same read ID shows up in the donorbam, but after merging the reads you'd need to do another round of realignment to fix the CIGAR strings which would be a pain. I think that the fix would need to be made back when the mutations are first introduced into the reads.

This might affect other open issues like #174.