some sequences are missing in pyfastx.Fasta object
dawnmy opened this issue · 4 comments
I loaded a fasta file containing 4542 sequences with average length of 2.5kb, however only 4539 sequences were in the pyfastx.Fasta object.
fa = pyfastx.Fasta('assembly.fasta')
fa['contig_4540'] # keyError
Besides, I could access a sequence e.g. fa['contig_999']
for the first time. But when I try to access it again I got keyError.
The version of pyfastx I used is 0.8.4
, Python version 3.7
Thank you for reporting this issue. I will check that. A new version will be released soon.
Any updates on this? I'm getting the same error: I'm loading a large fasta file (~59M entries), and for some of the indices (when accessing by string key and by integer index), I'm getting a key does not exist error. Reloading the file solves the problem for given keys, but shifts it to others.
I'm using pyfastx 1.1.0
Thanks. Could you provide me your code and data https links.
I'm using the unzipped version of this file https://stringdb-downloads.org/download/protein.sequences.v12.0.fa.gz.
As for my code, the simple snippet below does not seem to reproduce this error:
import pyfastx
from tqdm import tqdm
FILEPATH="/dccstor/bmfmbio/datasets/STRING/all/protein.sequences.v12.0.fa"
loaded_fasta = pyfastx.Fasta(FILEPATH)
for idx in tqdm(range(int(5e7))):
a = loaded_fasta[idx]
Maybe it has to do with multiple workers accessing the same fasta file? I'm afraid I cannot post the actual code I'm using at this point.