mgymrek/lobstr-code

Only one value for multicopy STR markers reported

Closed this issue · 2 comments

I followed the instructions on the website for Genotyping Y-STRs and CODIS markers in ordered to get STR values of HG01515 from the 1000 Genomes Project: ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase1/data/HG01515/alignment I used the HG01515.mapped.ILLUMINA.bwa.IBS.low_coverage.20101123.bam file and ran the following commands:

lobSTR --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ -f HG01515.mapped.ILLUMINA.bwa.IBS.low_coverage.20101123.bam --bam --noweb --rg-sample HG01515 --rg-lib HG01515 --fft-window-size 24 --fft-window-step 12 --out HG01515

samtools sort HG01515.aligned.bam HG01515.sorted

samtools index HG01515.sorted.bam

allelotype --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ --strinfo hg19_v3.0.2/lobstr_v3.0.2_hg19_strinfo.tab --command classify --noise_model /usr/local/share/share/lobSTR/models/illumina_v3.pcrfree --bam HG01515.sorted.bam --min-border 5 --min-bp-before-indel 7 --maximal-end-match 15 --noweb --min-read-end-match 5 --out HG01515

intersectBed -a HG01515.vcf -b lobSTR_ystr_hg19.bed -wa -wb | cut -f 1,2,10,14- | sed 's/:/\t/g' | cut -f 1,2,4,7,11- | sed 's/\//\t/' | cut -f 4 --complement | awk '{print $0 "\t" $6+$4/$5}

and I got the following results:

chrY    3131152 0|1 0   4   12  DYS393  12
chrY    3279702 4|2 4   4   9   DYS640  10
chrY    3679660 0|1 0   4   10  DYS572  10
chrY    6911569 0|1 0   4   11  DYS455  11
chrY    7053359 4|1 4   4   16  DYS576  17
chrY    8126300 0|1 0   5   8   DYS450  8
chrY    8224156 0|2 0   4   11  DYS454  11
chrY    8426378 -3|1    -3  3   22  DYS481  21
chrY    8466195 0|1 0   4   11  DYS531  11
chrY    8555980 0|2 0   5   8   DYS590  8
chrY    14466994    -4|1    -4  4   16  DYS437  15
chrY    16134296    0|1 0   4   10  DYS641  10
chrY    16508484    0|1 0   3   8   DYS472  8
chrY    17426012    -5|3    -5  5   11  DYS643  10
chrY    18393226    0|1 0   4   12  DYS533  12
chrY    18718879    -4|2    -4  4   15  GATA-A10    14
chrY    19134850    0|2 0   3   12  DYS426  12
chrY    19622111    -8|1    -8  2   23  YCAIIa/b    19
chrY    20203508    6|2 6   3   10  DYS425  12
chrY    20440393    3|4 3   3   15  DYS395S1a/b 16
chrY    20842518    0|1 0   4   14  DYS385a/b   14
chrY    21050690    -4|1    -4  4   12  DYS461  11
chrY    21050842    4|1 4   4   10  DYS460  11
chrY    21317047    0|1 0   4   11  DYS462  11
chrY    21520224    -4|1    -4  4   13  DYS549  12
chrY    21656837    0|1 0   5   10  DYS594  10
chrY    22092602    0|1 0   4   12  DYS445  12
chrY    22562564    0|2 0   4   9   DYS578  9
chrY    22633873    0|2 0   3   13  DYS392  13
chrY    22634857    0|1 0   4   12  DYS636  12
chrY    23843595    0|1 0   4   10  DYS406S1    10
chrY    24365178    0|1 0   6   8   DYS448.2    8
chrY    26078851    0|3 0   4   10  DYS459a/b   10
chrY    27087611    0|1 0   4   15  DYS464a/b/c/d   15

As you can see, multicopy markers have only one value. I ran the same commands (I have a wrapper script) on a FTDNA Big Y .bam file and got only one value on multicopy markers too. Is there a problem with the documentation, with lobSTR or maybe I am doing something wrong?

Hello,
As mentioned on in the tutorial, markers with multiple copies end up requiring manual inspection for now. This involves looking at how many reads support each allele (column 3 in the example you posted above). Especially in low coverage data such as 1000 genomes, it could just be that you only covered one allele at these markers. If you see evidence for only one allele (e.g. YCAIIa/b=19), you cannot rule out that there is another allele present, it might just not have been covered.

Thanks!