some sequences are missing in pyfastx.Fasta object

Question

some sequences are missing in pyfastx.Fasta object

dawnmy opened this issue 3 years ago · 4 comments

I loaded a fasta file containing 4542 sequences with average length of 2.5kb, however only 4539 sequences were in the pyfastx.Fasta object.

fa = pyfastx.Fasta('assembly.fasta')
fa['contig_4540'] # keyError

Besides, I could access a sequence e.g. fa['contig_999'] for the first time. But when I try to access it again I got keyError.

The version of pyfastx I used is 0.8.4, Python version 3.7

Answer 1 · 2022-03-15T13:07:38.000Z

Thank you for reporting this issue. I will check that. A new version will be released soon.

Answer 2 · 2023-08-31T09:16:27.000Z

Any updates on this? I'm getting the same error: I'm loading a large fasta file (~59M entries), and for some of the indices (when accessing by string key and by integer index), I'm getting a key does not exist error. Reloading the file solves the problem for given keys, but shifts it to others.
I'm using pyfastx 1.1.0

Answer 3 · 2023-08-31T09:21:17.000Z

Thanks. Could you provide me your code and data https links.

Answer 4 · 2023-08-31T11:18:51.000Z

I'm using the unzipped version of this file https://stringdb-downloads.org/download/protein.sequences.v12.0.fa.gz.
As for my code, the simple snippet below does not seem to reproduce this error:

import pyfastx
from tqdm import tqdm
FILEPATH="/dccstor/bmfmbio/datasets/STRING/all/protein.sequences.v12.0.fa"
loaded_fasta = pyfastx.Fasta(FILEPATH)
for idx in tqdm(range(int(5e7))):
a = loaded_fasta[idx]

Maybe it has to do with multiple workers accessing the same fasta file? I'm afraid I cannot post the actual code I'm using at this point.