Homopolymer false positives
hdashnow opened this issue · 11 comments
I think we are incorrectly calling homopolyer expansions in regions that are rich in a particular base, and/or where there is an insertion that is rich in a particular base.
All of these are aligned to hg38.
Illumina reads:
/scratch/ucgd/lustre-work/quinlan/u6018199/chaisson_2019/ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/data/PUR/HG00733/high_cov_alignment/HG00733.alt_bwamem_GRCh38DH.20150715.PUR.high_coverage.cram
STRling calls:
/uufs/chpc.utah.edu/common/HIPAA/u6026198/storage/git/STRling/working/chaisson_2019/HG00733.alt_bwamem_GRCh38DH.20150715.PUR.high_coverage-unplaced.txt
PacBio assemblies:
/scratch/ucgd/lustre-work/quinlan/u6018199/chaisson_2019/pacbio_local_assemblies/HG00733.*.bam
False positive STRling calls:
chr1:875844-876015 C
chr1:1350049-1350049 G
chr1:3167585-3167954 C
(This is the locus in the image)
I try:
samtools view -T /data/human/Homo_sapiens_assembly38.fasta ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/data/PUR/HG00733/high_cov_alignment/HG00733.alt_bwamem_GRCh38DH.20150715.PUR.high_coverage.cram -b -o x.bam chr1:2067585-4267954
samtools index x.bam
./src/strling extract -f /data/human/Homo_sapiens_assembly38.fasta x.bam x.bin -g /data/human/Homo_sapiens_assembly38.fasta.strling
./src/strling call -f /data/human/Homo_sapiens_assembly38.fasta -o x x.bam x.bin
grep 316 x-genotype.txt
and I see no calls for that region.
this is attempting to find chr1:3167585-3167954
here is an example read that gets called as homopolymer:
GGGGAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAATGCAACAGGGGCGGTGGGAGGCAGGGGGGGGGTGGGAGGAGGGCAGGAGCAGGGGGGCAATGTCAC
that has a CIGAR of 57M69S
How about a requirement that repeat units be (mostly) consecutive?
In [1]: s = "GGGGAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAATGCAACAGGGGCGGTGGGAGGCAGGGGGGGGGTGGGAGGAGGGCAGGAGCAGGGGGGCAATGTCAC"
In [2]: s[:57]
Out[2]: 'GGGGAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG'
so it's ~100% repeat unit. it's just that there's a ~57 base stretch of G there in the reference.
So we need to count the repeat units in the reference and the reads
here is the pair of reads:
ERR895347.233154401 177 chr1 3168351 60 126M chr2 32916480 0 TTCCTGCCCCACCCCTGCCTGGTTTTCTGGAACCCCTGGAACCAGCATCACACAAGTGAGCAAGTGGCTCTTTCCAATCTTACAACCCCCGCGGTGGAAGGAACAGGTGGAAAGATTGTTGAGACT 7<<<00B<BB7BB<0BBBB0B07B<BBBBB<BBBBBBBBBBBBBB<B<BBBBBBB<BBBBBB<<BBBBBBBBBBBBBBBBBBBBBBBBB7B7B<BBBBBBBBBBBB<BBBBBBBBBB<BBBB<<BB AS:i:125 MQ:i:0 XS:i:20 MD:Z:0C125 NM:i:1 RG:Z:ERR895347
ERR895347.233154401 113 chr2 32916480 0 57M69S chr1 3168351 0 GGGGAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAATGCAACAGGGGCGGTGGGAGGCAGGGGGGGGGTGGGAGGAGGGCAGGAGCAGGGGGGCAATGTCAC ''''''''''''''''''''''''''''''''''''''''''''''''''''B<B<0000070000000<0'000<B<BB<<0B<070B0B<<0B<B<000BBBBBBB07BBBB<BBBBBBB<77< AS:i:51 MQ:i:60 SA:Z:chr1,3167612,+,52M74S,0,2; XA:Z:chr2,-32916507,7S51M68S,0;chr2,-32916284,7S51M68S,0;chr2,-32916345,56M70S,2;chr2,-32916253,7S50M69S,0;chr2,-32916395,57M69S,3; XS:i:51 MD:Z:4G0G51 NM:i:2 RG:Z:ERR895347
and here is an example pair for the region starting with "135":
ERR899726.29422672 161 chr1 1349848 60 126M chr2 32916487 0 GACTTCCGAGGCGTCCCTTGAGGCCGCGGGGCGTCCGCGCTTACTGTTCCGGGGCTCGCGGGGCTGGAGGCCTTCCCGACCCGACCCCAGCAGGAAGCGCGGAGGCCACTAGAGGGCGGCCCAGGA BB<BBBB7B<BB7<BBBBBBBBBBB7B7BBBB7BBB7B7BBBB<BBBBBB7BBBBBB7B7BB<<BBBBBBBBBBBBB7B<BB7B<BBB<BBBBBBBBB7B7BBBBBBB<B<B0BBBB7BBBB<<<' AS:i:126 MQ:i:4 XS:i:0 MD:Z:126 NM:i:0 RG:Z:ERR899726
ERR899726.29422672 81 chr2 32916487 4 94M32S chr1 1349848 0 GGCGGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAGGGGGGGAGTGAGGAGGGGGCCTGGACGGGGCAGGAG '''''''''''''BBBB0B<B<BBBB0<<<0<B7000700<0000B777770B<<BB0<B0B<<B7B<000770<0BB<0B<0000<<0<BBBBB00<BBBBB0BBBBBBBBBBB7BBBBBB<<B< AS:i:83 MQ:i:60 SA:Z:chr1,1351095,-,90S36M,0,0; XA:Z:chr2,-32916353,96M30S,4;chr2,-32916264,83M43S,3; XS:i:80 MD:Z:2G2G82G5 NM:i:3 RG:Z:ERR899726
in both of these cases, we are moving the 2nd read as it has low-quality, based on the first read, which has higher quality, but, it seems we are still considering the cigar string. if we move a poorly mapped read, whose mate is mapped well, we should set the cigar string of the poorly mapped read to all matches. I don't think we do this.
I think we can accommodate this in adjust_by
ok. so in adjust_by
, given a pair of reads A
and B
if B is mapped well and A is repeat and/or not mapped well, then we adjust A to position of B, and we null out any clipping that A had. However, we do not adjust the align_length
. So, in the first example above, the cigar is 57M69S
, but then that read is moved. The align_length remains as 57
but that no longer makes sense. I think if we move a read, we should require that > 80% (or whatever that parameter is) of the entire read is repetitive.