deepomicslab/SpecHLA

ExtractHLAread.sh

vegetableyu opened this issue · 5 comments

Hi,
When I apply the ExtractHLAread.sh script, the unpaired reads produced are far more than the paired reads. After researching, I saw that some people say the bam file should be sorted by read name before converting to fastq, while in ExtractHLAread.sh, it is sorted by coordinates. However, when I switched to sorting by read name, although it reduced a lot of unpaired reads, running SpecHLA.sh afterwards generated a large number of “-” results. This is very confusing, do you know what might be the situation?

Hi,

  • The bam file should be sorted by coordinates using samtools sort, because ExtractHLAread.sh extracts HLA-related reads by coordinates. In converting bam to fastq, samtools fastq can handle such bam file.

As for the large number of unpaired reads, please check following two points:

  • please ensure the reference version (hg19 or hg38) is the same between generating bam and running ExtractHLAread.sh.
  • For the unpaired reads, please check the alignment situation in the bam, see what alleles the unextracted read end mapped to.

please let me know if it works.

Best,
shuai

Hi shuai,
I looked at samtools fastq --help, which also says it needs to be sorted by reads name. However, using samtools fastq for bam files sorted by reads name and coordinates yielded little difference in results. With bam bam2Fastq, however, bam files sorted by read name get 20 times as many paired reads as those sorted by coordinates.
samtools

Hi,
Apologies for any confusion caused. It seems that ExtractHLAread.sh utilizes bam bam2Fastq to obtain the fastq files. In order to investigate the discrepancy in read numbers resulting from bam bam2Fastq, I suggest examining the reads extracted from the "sorted by read name" but not from the "sorted by coordinates" approach. Please select a subset of these reads and verify the regions to which they are mapped by examining the BAM. If the reads belong to the HLA region, it is likely that the "sorted by read name" approach is correct; otherwise, the "sorted by coordinates" approach may be more appropriate.

By the way, could you please check the alignment detail of the reads ("unpaired reads produced are far more than the paired reads") in the original BAM.

Hi shuai,
Thanks for your reply and suggestions, i will check this later. I noticed that others were also discussing the issue of preprocessing and gave some suggestions. This might help:
Kingsford-Group/kourami#30
https://bitbucket.nygenome.org/projects/COMPBIO/repos/hla_prep/browse

Thanks for the information. According to the issue, the reason of extracting too much unpaired reads might be keeping multiple alignments other than only the primary alignment.