lmdu/pyfastx

Unicode error reading large fastq with index?

Closed this issue · 2 comments

Got a unicode error while iterating a large fastq. Seems to reproduce on any large fastq.

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <listcomp>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 0: invalid continuation byte

Steps to reproduce:

# Warning: will download a 15gb fastq. Probably overkill but reproduces.
docker run --rm -it -v $(pwd):$(pwd) -w $(pwd) ncbi/sra-tools fasterq-dump --progress SRR15035500

docker run --rm -it -v $(pwd):$(pwd) -w $(pwd) mambaorg/micromamba bash -c 'micromamba install -y pyfastx==1.0.1 -c conda-forge -c bioconda -c defaults && python -c "import pyfastx; [print(read) for read in pyfastx.Fastq(\"SRR15035500.fastq\")]"'

Not sure if it's related to 6bfa15b / #39 / #56? Thank you as always for the great tool!

lmdu commented

Thank you! I have fixed it in the version 1.1.0.

Thank you! 🙏