Only one value for multicopy STR markers reported
Closed this issue · 2 comments
I followed the instructions on the website for Genotyping Y-STRs and CODIS markers in ordered to get STR values of HG01515 from the 1000 Genomes Project: ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase1/data/HG01515/alignment
I used the HG01515.mapped.ILLUMINA.bwa.IBS.low_coverage.20101123.bam
file and ran the following commands:
lobSTR --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ -f HG01515.mapped.ILLUMINA.bwa.IBS.low_coverage.20101123.bam --bam --noweb --rg-sample HG01515 --rg-lib HG01515 --fft-window-size 24 --fft-window-step 12 --out HG01515
samtools sort HG01515.aligned.bam HG01515.sorted
samtools index HG01515.sorted.bam
allelotype --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ --strinfo hg19_v3.0.2/lobstr_v3.0.2_hg19_strinfo.tab --command classify --noise_model /usr/local/share/share/lobSTR/models/illumina_v3.pcrfree --bam HG01515.sorted.bam --min-border 5 --min-bp-before-indel 7 --maximal-end-match 15 --noweb --min-read-end-match 5 --out HG01515
intersectBed -a HG01515.vcf -b lobSTR_ystr_hg19.bed -wa -wb | cut -f 1,2,10,14- | sed 's/:/\t/g' | cut -f 1,2,4,7,11- | sed 's/\//\t/' | cut -f 4 --complement | awk '{print $0 "\t" $6+$4/$5}
and I got the following results:
chrY 3131152 0|1 0 4 12 DYS393 12
chrY 3279702 4|2 4 4 9 DYS640 10
chrY 3679660 0|1 0 4 10 DYS572 10
chrY 6911569 0|1 0 4 11 DYS455 11
chrY 7053359 4|1 4 4 16 DYS576 17
chrY 8126300 0|1 0 5 8 DYS450 8
chrY 8224156 0|2 0 4 11 DYS454 11
chrY 8426378 -3|1 -3 3 22 DYS481 21
chrY 8466195 0|1 0 4 11 DYS531 11
chrY 8555980 0|2 0 5 8 DYS590 8
chrY 14466994 -4|1 -4 4 16 DYS437 15
chrY 16134296 0|1 0 4 10 DYS641 10
chrY 16508484 0|1 0 3 8 DYS472 8
chrY 17426012 -5|3 -5 5 11 DYS643 10
chrY 18393226 0|1 0 4 12 DYS533 12
chrY 18718879 -4|2 -4 4 15 GATA-A10 14
chrY 19134850 0|2 0 3 12 DYS426 12
chrY 19622111 -8|1 -8 2 23 YCAIIa/b 19
chrY 20203508 6|2 6 3 10 DYS425 12
chrY 20440393 3|4 3 3 15 DYS395S1a/b 16
chrY 20842518 0|1 0 4 14 DYS385a/b 14
chrY 21050690 -4|1 -4 4 12 DYS461 11
chrY 21050842 4|1 4 4 10 DYS460 11
chrY 21317047 0|1 0 4 11 DYS462 11
chrY 21520224 -4|1 -4 4 13 DYS549 12
chrY 21656837 0|1 0 5 10 DYS594 10
chrY 22092602 0|1 0 4 12 DYS445 12
chrY 22562564 0|2 0 4 9 DYS578 9
chrY 22633873 0|2 0 3 13 DYS392 13
chrY 22634857 0|1 0 4 12 DYS636 12
chrY 23843595 0|1 0 4 10 DYS406S1 10
chrY 24365178 0|1 0 6 8 DYS448.2 8
chrY 26078851 0|3 0 4 10 DYS459a/b 10
chrY 27087611 0|1 0 4 15 DYS464a/b/c/d 15
As you can see, multicopy markers have only one value. I ran the same commands (I have a wrapper script) on a FTDNA Big Y .bam file and got only one value on multicopy markers too. Is there a problem with the documentation, with lobSTR or maybe I am doing something wrong?
Hello,
As mentioned on in the tutorial, markers with multiple copies end up requiring manual inspection for now. This involves looking at how many reads support each allele (column 3 in the example you posted above). Especially in low coverage data such as 1000 genomes, it could just be that you only covered one allele at these markers. If you see evidence for only one allele (e.g. YCAIIa/b=19), you cannot rule out that there is another allele present, it might just not have been covered.
Thanks!