Pyfastx.read.raw mangling reads
Closed this issue · 8 comments
When iterating through a gzipped fastq file and dumping reads to various output filehandles using read.raw from records stored in a dict reads dumped to files are sometimes mangled. It seems that the way the raw string is generated is flawed in some way leading to weirdness like the below, this occurs predictably and the mangling can be changed by removing a read from the start of the file.
I managed to get around this by not indexing the fastq files in question and reconstructing the record from the name, seq, qual tuple.
An example of a mangled read record:
@SRR5382385.27877 27877 length=135
ATCCAACCGAGCAAGCCGTTCCCGGCTGGTACTCAAATGTAAAAATTTTTATCTGCCGTCTTGAAAATTTTTCCGGCGAACATAAATTTTTATCACGCGACGCCGAATGCGGATCGCGCACTTAATCACTTGGAT
+SRR5382385.27877 27877 length=135
AAF<FFFAFFFJFJFJJJ<7FFF<<<JJF<JJJJFJJFJJFJJJJJJJJJJ<FJFJ--7AF<F<F-FJ-JJJFF<A<-77JFJJF-FJJJJJFA<JFFAAA-<7A-FFAFFJ--7A7FFFJJJFJ77FA--FAFJ
@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTTGCCCGCGAA@SRR5382385.27886 27886 length=135
AGTTTT@SRR5382385.27887 27887 length=135
ACTTTGCGCTCACCGAAAAACGCCGCCTGCAATGCCGGAATACCCAATTGCCGATAGGGGCTGATATTACGAAATTCGCGGGCAAAACTCGGCTCGCCAAACGGGCGGTGCATGGTGCGCGTCAGGTAGTTGCTC
+SRR5382385.27887 27887 length=135
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ-FJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJ-7FFJJJJJJJJJJJJJJJJJJJ
A script which can be used to reproduce this issue is available here happy to provide any more information if needed to trace the problem.
Thank you for reporting this issue. I will check and fix it.
I have download the SRR5382385 to test. But I could not reproduce the issue. Could you provide a simple script for me to reproduce this issue. Thanks.
Sure no problem, I successfully replicated the issue with this script:
import pyfastx
import sys
input_fastq = pyfastx.Fastq(sys.argv[1])
for read in input_fastq:
sys.stdout.write(read.raw)
Using version 2.0.2 installed from pip on an M2 arm64 macbook with a 697M gzipped fastq, whether or not the fastq is compressed seems to have no impact leading to identical mangling. I can provide the specific fastq if desired but I get the same issue with any large fastq.
Thanks! I have no macbook with M2 arm64 to test. But i will try to do it. That may take a long time. You can try to use python setup.py install
to install pyfastx from source code.
I just fully reproduced the problem in a fresh environment on an x86_64 cluster like:
mamba env create -n pyfastx_test pip seqtk
wget {fastq_file}
seqtk size {fastq_file}
Returns 761975 379155685
pip install pyfastx
python pyfastx_minimal_example.py {fastq_file} > test.fastq
seqtk size test.fastq
Returns 646 378608
With the 647th read looking like (quality line for first read merged with read id line from next read):
@18569ca3-3f8b-4fff-b68b-9b2ab5e81f7d runid=ca30319f25b536efe90b79a721af63532e657a6c read=4884 ch=56 start_time=2023-09-22T09:48:04.386162+00:00 flow_cell_id=FAW72829
ATGTCCTGTTCTTGGTTCAGTTATGCGACCAAGCCATATTCACGGGAAAGTGGCCTACCGTGACTCGATTCCGTTTGTAGTCGTCTGTCGTTTTTCGTGCGCCGCTTCAACTATGTCCCCCGCTCACGCGTCACCGACTGCCAGCGACGGCCGGGTATCGGCCCGACGCTCCAGCGCCATCCATTTTCAGGGCTAGTTGATCCGGCAGGTGAGTTGTTACACACTCCTTAGCGGATTCCGACTTCCACGGCCACCGTCCTGCTGTCTATATCAACCAACACCTTTTCTGGGGTCTGATGAGCGTCGGCATCGGGCGCCTAACCCGGCGTTCGGTTCATCCCGCAGCGCCAGTTCTCTACCAAAAGTGGCCCGCTTCGGCACTCGCATTCCACGCCCGGCTCCACGCCAGCGAGCCGGGCTTCTTACCCATTTAAAGTTTGAGAATAGGTTGAGATCCCCATGTTGAAGACGGCGCACGAAAACAACAACGACTACAACAGAAATGAGTCAC
+
))**'#$%%&%%$#$%(''''''&''&$####$%&%%&**&''$%%%&'()'(++..166?ABBB><<=@GIEGFGQIFKKIQHHKJC1011/,5@18569ca3-3f8b-4fff-b68b-9b2ab5e81f7d runid=ca30319f25b536efe90b79a721af63532e657a6c read=4884 ch=56 start_time=2023-09-22T09:48:04.386162+00:00 flow_cell_id=FAW72829 protocol_group_id=Flying_Sharks_Metagenomics_20230922 sample_id=Flying_Sharks_Metagenomics_20230922 barcode=barcode02 barcode_alias=barcode02 parent_read_id=18569ca3-3f8b-4fff-b68b-9b2ab5e81f7d basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_@45a218bd-8cb7-4368-b5d9-a4574d98ab7e runid=ca30319f25b536efe90b79a721af63532e657a6c read=867 ch=237 start_time=2023-09-22T09:14:55.386162+00:00 flow_cell_id=FAW72829 protocol_group_id=Flying_Sharks_Metagenomics_20230922 sample_id=Flying_Sharks_Metagenomics_20230922 barcode=barcode02 barcode_alias=barcode02 parent_read_id=45a218bd-8cb7-4368-b5d9-a4574d98ab7e basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_hac@v4.2.0 barcode=barcode02
ATGTTTTGCATCTTACTTCGTTCAGTTGTGAACGAACCGATATTAGCCGCCTGTCCTACCGTGACTCGATTCCGTTTGTAGTCGTCTGTCGTTTTTCGTGCGCCGCTTCAACATGGGCGCCGTCGGCGGCCGGGCCGAGGTGGGATCCCGAGGCCTCTCCAGTCCGCCGAGGGCGCACCACCGGCCCGTCTCGCCCGCCGCGCCGGGGAGGTGGAGCACGAGCGCACGTGTTAGGACCCGAAAGATGGTGAACTATGCCTGGGCAGGGCGAAGCCAGAGGAAACTCTGGTGGAGGTCCGTAGCGGTCCTGACGTGCAAATCGGTCGTCCGACCTGGGTATAGGGGCGAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTCCGAAGTTTCCCTCAGGATAGCTGGCGCTCTCGCACGACCACCCCCGCCGTTTATCCGGTAAAGCGAATGATTAGAGGTCTTGGGGCCGAAAACGATCTCAACCTATTCTCAAACTTTAAATGGGTAAGAAGCCCGGCTCGCTCGCGTGGAGCCGGGCGTGGAGGTGGATGCCTAGTGGGCCACTTTTGGTAAGCAGAACTGGCGCTGCGGGATGAACCGAACGCCAGGGTACGGCGCCGATGCCGACGCTCATAGGACCCCCAGAAAAGGTGTTGTTGATATAGACAGCAGGACGGTGGCCATGGAAGTCGGAATCCGCTAAGGAGTGTGTAACAACTCACCTGCCGAATCAACTAGCCCTGAAAATGGATGGCGCTGGAGCGTCGGGCCCATACCCGGCCGTCGCTGGCAGTCGGTGACGCGTGAGAGGGGGACGGGTATGGGCCCGACGCTCCAGCGCCATCCCCTTTCAGGGCTGATTGATACGGCAGGTGAGTCTACACACTCCTTAGCGGATTCCGACTTCCATGGCCACCGTCCTGCTGTCTATATCAACCAACACCTTTTCTGGTCGTGAGCGTCAGCCATGCGGGCGCCTTAACCCGGCGTTGGTCCGATCCCGCAGCGCCGGTCTGCTTACCGAAAGTGGCCCGCTAGGCACTCGCATTCCCACACCCTCCGCGCCGGCCGAGAGCCGCTCTTACCCATTTAAAGTTTGAGAATAGGTTGAGATCGTTTCGGCCCCAAGACCTCTAATAGTTCGCTTTACCGGATAAAACTGCGGGGGGTGGTCGTGCGAGAGCGCAGCTAGCCTGAGGGAGACCAGAGGGGACCGTACTAGATAGTTCGATTAGTCTTTCGCCCCTACCCAGGTCGGACGACCGAGTTTTAGCATCAGTACGGACCTCCACCAGAGCGTCCTCGCGGCTTCGCCCTGCCCGGGCGTAGCCCAGCCGTCTTTCGGGTCCTAACACGTGCGCTCGTGCTGCCTCCCGGCGCGGCGCGAGACGGCGGGTCGTGCCTCGGCGGACTGGGAGCCGGGACCCCCGCCGCCGACAGCACCATGTTGAAGCGGCGCACGAAAAACGACAGACGACTACAAACGGAATGGAGTCG
(pyfastx) jovyan:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: GenuineIntel
Model name: Intel Xeon Processor (Skylake, IBRS)
CPU family: 6
Model: 85
Thread(s) per core: 2
I'm also interested in this fix
Fixed in v2.1.0