Failure to properly parse ENCODE's GRCh38 reference

Question

Failure to properly parse ENCODE's GRCh38 reference

sjgosai opened this issue 3 years ago · 4 comments

Hello,

I wanted to use pyfastx for some ENCODE analysis and tried to use the Fasta class on the GRCh38 reference found here (more info here).. When I iterate through the fa object I get unexpected results (i.e. the "description" from chr1 is the header for chr2). Interestingly enough, opening the file as a Fastx object seems to work as expected. Unfortunately, I really want to use the fetch method which isn't implemented for Fastx.seq objects.

To troubleshoot I went ahead and tested the hg38.fa.gz fasta from UCSC found here. Unfortunately, I'm getting odd results with this file too. For example many of the description fields are short nucleotide sequences (i.e. for chr10 i get "CGGGC").

The description field errors may not have been an issue, except when you hit the end of the Fasta file you get a UnicodeDecodeError when the description field is called.

I am using version 0.8.4.

Here is a Google Colab notebook replicating the issue.

Answer 1 · 2022-01-21T13:21:43.000Z

Have you install pyfastx by pip?

Answer 2 · 2022-01-21T19:27:14.000Z

Yes. The first step is to install from PIP.

After that, I've fully reproduced the issue in the Colab notebook linked here. The notebook stands alone. It installs pyfastx and downloads the relevant data from ENCODE or UCSC. The notebook is roughly annotated, hopefully helping clarify the issue.

Answer 3 · 2023-01-01T14:05:16.000Z

We have fixed this issue in new versions >= 0.9.0

Answer 4 · 2023-01-01T14:05:24.000Z

We have fixed this issue in new versions >= 0.9.0