Failure to properly parse ENCODE's GRCh38 reference
sjgosai opened this issue · 4 comments
Hello,
I wanted to use pyfastx for some ENCODE analysis and tried to use the Fasta
class on the GRCh38 reference found here (more info here).. When I iterate through the fa
object I get unexpected results (i.e. the "description" from chr1 is the header for chr2). Interestingly enough, opening the file as a Fastx
object seems to work as expected. Unfortunately, I really want to use the fetch
method which isn't implemented for Fastx.seq
objects.
To troubleshoot I went ahead and tested the hg38.fa.gz
fasta from UCSC found here. Unfortunately, I'm getting odd results with this file too. For example many of the description fields are short nucleotide sequences (i.e. for chr10 i get "CGGGC").
The description field errors may not have been an issue, except when you hit the end of the Fasta file you get a UnicodeDecodeError
when the description field is called.
I am using version 0.8.4
.
Here is a Google Colab notebook replicating the issue.
Have you install pyfastx by pip?
Yes. The first step is to install from PIP.
After that, I've fully reproduced the issue in the Colab notebook linked here. The notebook stands alone. It installs pyfastx
and downloads the relevant data from ENCODE or UCSC. The notebook is roughly annotated, hopefully helping clarify the issue.
We have fixed this issue in new versions >= 0.9.0
We have fixed this issue in new versions >= 0.9.0