Parsing of BAM header when file is expected to be readname-sorted.
agilly opened this issue · 7 comments
Hello,
I'm trying to run qc3C bam
and as the doc specifies, I sorted my input bam by name:
samtools sort -t 32 -n -o $libname.aligned.sorted.bam $libname.aligned.bam
Then I run:
qc3C bam -k arima --fasta hg38.p13.fa.gz --bam $libname.aligned.sorted.bam --output-path $outpath -t 10
That produces an error:
[...]
INFO | 2022-01-24 04:33:24,898 | qc3C.bam_based | Found 1,237,255,262 alignments to analyse
[E::idx_find_and_load] Could not retrieve index file for 'masked_filename.aligned.sorted.bam'
ERROR | 2022-01-24 04:33:24,904 | main | masked_filename.aligned.sorted.bam must be sorted by name
This is strange because it seems something in the code is looking for a bam index, which is not possible to generate for a read-name-sorted BAM.
I am running into the same error using the conda version of qc3C
qc3C bam --enzyme DpnII --fasta /home/jon/Working_Files/s_chloronotus/hapo-g/round_3/hapog_results/hapog.fasta --bam /home/jon/Working_Files/s_chloronotus/bwa-mem/aligned.SRR8499559.bam --max-obs 100000 --output-path /home/jon/Working_Files/s_chloronotus/qc3c/SRR8499559_w_bam/
INFO | 2022-01-28 11:20:27,247 | qc3C.ligation | Loading cached cut-site database
INFO | 2022-01-28 11:20:27,422 | qc3C.bam_based | Accepting all usable reads
INFO | 2022-01-28 11:20:27,423 | qc3C.utils | Random seed was not set, using 1383103
INFO | 2022-01-28 11:20:27,423 | qc3C.bam_based | Beginning analysis...
Pairs: 0%| | 0/100000 [00:00<?, ?it/s][E::idx_find_and_load] Could not retrieve index file for '/home/jon/Working_Files/s_chloronotus/bwa-mem/aligned.SRR8499559.bam'
ERROR | 2022-01-28 11:20:27,429 | main | 'HD'
Pairs: 0%|
Hi @agilly,
There appears to be something strange happening here. One expected and one unexpected.
First issue
The warning emitted by HTSlib about a missing index is known and spurious when it comes to query-name sorted files. There is yet to be a fix in PySAM to address how the underlying HTSlib API is called. See pysam-developers/pysam#939 for more detail.
I may suppress this message, as I had been hoping PySAM would be modified by now to avoid raising it.
Second issue
qc3C makes an explicit test for the sorting type. This is done by inspecting the 'HD' record within BAM header, and it would appear that your HD record does not specify query-name sorting.
Could you please provide me with that string?
ie.
samtools view -H {your bamfile} | grep '^@HD'
An example of the header string is as follows.
@HD VN:1.6 SO:queryname
I suspect your header does not contain an 'SO' tag-field or the value associated deviates from 'queryname'.
In either case, it will help to see what you have and -- importantly -- if you could also describe to me the workflow which creates the BAM that goes into qc3C.
Hi Matthew,
thanks for your reply. Indeed I can confirm that my bam file does not contain the expected header:
$> samtools view -H L79851_Track-123454.aligned.sorted.bam| grep '^@HD'
@HD VN:1.6 SO:unknown
The bam files were generated as follows:
samtools sort -n [aligned.bam] -o [aligned.sorted.bam]
However, the logs for that particular step are missing so I will regenerate just to be sure.
UPDATE: Running the above command gives me :
@HD VN:1.6 SO:queryname
Going back to my command, I used -t 32
to determine threads, but that was actually a mistake. From the help:
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
I confirmed that the first million read names in both files is the same, so perhaps -t 32
did not actually change the -n
sort, but it must have caused the SO
to turn to unknown
.
No worries. I always find it curious that samtools went with -@ n
for thread count. This is the only suite of programs I am aware of that has elected to use that character. Perhaps it was a brilliant choice, but being the odd-man-out can lead to issues just like you've experienced.
I am going to reopen this issue, with the intention of reporting the sort order if found as part of the error.