quinlan-lab/STRling

Homopolymer false positives

hdashnow opened this issue · 11 comments

I think we are incorrectly calling homopolyer expansions in regions that are rich in a particular base, and/or where there is an insertion that is rich in a particular base.

All of these are aligned to hg38.

Illumina reads:
/scratch/ucgd/lustre-work/quinlan/u6018199/chaisson_2019/ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/data/PUR/HG00733/high_cov_alignment/HG00733.alt_bwamem_GRCh38DH.20150715.PUR.high_coverage.cram

STRling calls:
/uufs/chpc.utah.edu/common/HIPAA/u6026198/storage/git/STRling/working/chaisson_2019/HG00733.alt_bwamem_GRCh38DH.20150715.PUR.high_coverage-unplaced.txt

PacBio assemblies:
/scratch/ucgd/lustre-work/quinlan/u6018199/chaisson_2019/pacbio_local_assemblies/HG00733.*.bam

False positive STRling calls:

chr1:875844-876015 C

chr1:1350049-1350049 G

chr1:3167585-3167954 C
(This is the locus in the image)

homopolymerFP

I try:

samtools view -T /data/human/Homo_sapiens_assembly38.fasta ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/hgsv_sv_discovery/data/PUR/HG00733/high_cov_alignment/HG00733.alt_bwamem_GRCh38DH.20150715.PUR.high_coverage.cram -b -o x.bam chr1:2067585-4267954
samtools index x.bam
./src/strling extract -f /data/human/Homo_sapiens_assembly38.fasta x.bam x.bin -g /data/human/Homo_sapiens_assembly38.fasta.strling
./src/strling call -f /data/human/Homo_sapiens_assembly38.fasta -o x x.bam x.bin
grep 316 x-genotype.txt

and I see no calls for that region.

this is attempting to find chr1:3167585-3167954

here is an example read that gets called as homopolymer:

GGGGAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAATGCAACAGGGGCGGTGGGAGGCAGGGGGGGGGTGGGAGGAGGGCAGGAGCAGGGGGGCAATGTCAC

that has a CIGAR of 57M69S

How about a requirement that repeat units be (mostly) consecutive?

In [1]: s = "GGGGAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAATGCAACAGGGGCGGTGGGAGGCAGGGGGGGGGTGGGAGGAGGGCAGGAGCAGGGGGGCAATGTCAC"                            

In [2]: s[:57]                                                                                                                                                          
Out[2]: 'GGGGAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG'

so it's ~100% repeat unit. it's just that there's a ~57 base stretch of G there in the reference.

So we need to count the repeat units in the reference and the reads

here is the pair of reads:

ERR895347.233154401	177	chr1	3168351	60	126M	chr2	32916480	0	TTCCTGCCCCACCCCTGCCTGGTTTTCTGGAACCCCTGGAACCAGCATCACACAAGTGAGCAAGTGGCTCTTTCCAATCTTACAACCCCCGCGGTGGAAGGAACAGGTGGAAAGATTGTTGAGACT	7<<<00B<BB7BB<0BBBB0B07B<BBBBB<BBBBBBBBBBBBBB<B<BBBBBBB<BBBBBB<<BBBBBBBBBBBBBBBBBBBBBBBBB7B7B<BBBBBBBBBBBB<BBBBBBBBBB<BBBB<<BB	AS:i:125	MQ:i:0	XS:i:20	MD:Z:0C125	NM:i:1	RG:Z:ERR895347
ERR895347.233154401	113	chr2	32916480	0	57M69S	chr1	3168351	0	GGGGAAAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAAATGCAACAGGGGCGGTGGGAGGCAGGGGGGGGGTGGGAGGAGGGCAGGAGCAGGGGGGCAATGTCAC	''''''''''''''''''''''''''''''''''''''''''''''''''''B<B<0000070000000<0'000<B<BB<<0B<070B0B<<0B<B<000BBBBBBB07BBBB<BBBBBBB<77<	AS:i:51	MQ:i:60	SA:Z:chr1,3167612,+,52M74S,0,2;	XA:Z:chr2,-32916507,7S51M68S,0;chr2,-32916284,7S51M68S,0;chr2,-32916345,56M70S,2;chr2,-32916253,7S50M69S,0;chr2,-32916395,57M69S,3;	XS:i:51	MD:Z:4G0G51	NM:i:2	RG:Z:ERR895347

and here is an example pair for the region starting with "135":

ERR899726.29422672	161	chr1	1349848	60	126M	chr2	32916487	0	GACTTCCGAGGCGTCCCTTGAGGCCGCGGGGCGTCCGCGCTTACTGTTCCGGGGCTCGCGGGGCTGGAGGCCTTCCCGACCCGACCCCAGCAGGAAGCGCGGAGGCCACTAGAGGGCGGCCCAGGA	BB<BBBB7B<BB7<BBBBBBBBBBB7B7BBBB7BBB7B7BBBB<BBBBBB7BBBBBB7B7BB<<BBBBBBBBBBBBB7B<BB7B<BBB<BBBBBBBBB7B7BBBBBBB<B<B0BBBB7BBBB<<<'	AS:i:126	MQ:i:4	XS:i:0	MD:Z:126	NM:i:0	RG:Z:ERR899726
ERR899726.29422672	81	chr2	32916487	4	94M32S	chr1	1349848	0	GGCGGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGGGGGGGGAGGGGGGGAGTGAGGAGGGGGCCTGGACGGGGCAGGAG	'''''''''''''BBBB0B<B<BBBB0<<<0<B7000700<0000B777770B<<BB0<B0B<<B7B<000770<0BB<0B<0000<<0<BBBBB00<BBBBB0BBBBBBBBBBB7BBBBBB<<B<	AS:i:83	MQ:i:60	SA:Z:chr1,1351095,-,90S36M,0,0;	XA:Z:chr2,-32916353,96M30S,4;chr2,-32916264,83M43S,3;	XS:i:80	MD:Z:2G2G82G5	NM:i:3	RG:Z:ERR899726

in both of these cases, we are moving the 2nd read as it has low-quality, based on the first read, which has higher quality, but, it seems we are still considering the cigar string. if we move a poorly mapped read, whose mate is mapped well, we should set the cigar string of the poorly mapped read to all matches. I don't think we do this.

I think we can accommodate this in adjust_by

ok. so in adjust_by, given a pair of reads A and B if B is mapped well and A is repeat and/or not mapped well, then we adjust A to position of B, and we null out any clipping that A had. However, we do not adjust the align_length. So, in the first example above, the cigar is 57M69S, but then that read is moved. The align_length remains as 57 but that no longer makes sense. I think if we move a read, we should require that > 80% (or whatever that parameter is) of the entire read is repetitive.

AFAICT, this is addressed in 06be16d
but that requires further testing