`FastaReader` returns empty pandas dataframe
abearab opened this issue · 8 comments
I'm trying to load this FASTA data using FastaReader
module but it seems there is something wrong! Any thoughts?!
Python implementation: CPython
Python version : 3.9.18
IPython version : 8.18.1
Compiler : GCC 11.2.0
OS : Linux
Release : 5.10.0-28-cloud-amd64
Machine : x86_64
Processor :
CPU cores : 16
Architecture: 64bit
matplotlib: 3.6.3
screenpro : 0.2.6
anndata : 0.10.5.post1
biobear : 0.14.6
sys : 3.9.18 (main, Sep 11 2023, 13:41:44)
[GCC 11.2.0]
pandas : 1.5.3
polars : 0.20.10
Thanks for filing this, and sorry for the issue! I think the issue is due to the .fa
extension. In the short term, could you rename the file to with .fasta
? And I can make it easier to configure in the medium term. Like you probably know fasta has a variety of extensions compared to something like bam or vcf.
In [1]: import biobear as bb
In [2]: !wget -O CRISPRi_v2_human.trim_1_29_forward.fasta https://raw.githubusercontent.com/mhorlbeck/ScreenProcessing/master/library_reference/CRISPRi_v2_human.trim_1_29_forward.fa
--2024-02-22 17:06:25-- https://raw.githubusercontent.com/mhorlbeck/ScreenProcessing/master/library_reference/CRISPRi_v2_human.trim_1_29_forward.fa
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11799370 (11M) [text/plain]
Saving to: ‘CRISPRi_v2_human.trim_1_29_forward.fasta’
CRISPRi_v2_human.trim_1_29_forward.fasta 100%[====================================================================================================================================>] 11.25M 16.2MB/s in 0.7s
2024-02-22 17:06:27 (16.2 MB/s) - ‘CRISPRi_v2_human.trim_1_29_forward.fasta’ saved [11799370/11799370]
In [3]: df = bb.FastaReader('CRISPRi_v2_human.trim_1_29_forward.fasta').to_pandas()
In [4]: df
Out[4]:
id description sequence
0 A1BG_-_58858617.23-P1 None GAGACCCAGCGCTAACCAGGTTTAAGAG
1 A1BG_-_58858788.23-P1 None GGGCACCCAGGAGCGGTAGGTTTAAGAG
2 A1BG_+_58858964.23-P1 None CTCCGGGCGACGTGGAGTGGTTTAAGAG
3 A1BG_-_58858630.23-P1 None AACCAGGGGTGCCCAAGGGGTTTAAGAG
4 A1BG_+_58858549.23-P1 None GCGAGGAACCGCCCAGCAAGTTTAAGAG
... ... ... ...
209065 non-targeting_03785 None CTACTCCGCCCCGCGGGAGGTTTAAGAG
209066 non-targeting_03786 None TGGCCGTTCATGGGACCGGGTTTAAGAG
209067 non-targeting_03787 None AACTCTGTAGAAGGGACCGGTTTAAGAG
209068 non-targeting_03788 None CTCGACAGCGACTGAAGAGGTTTAAGAG
209069 non-targeting_03789 None AGAAGCCAACCTCGCGTTAGTTTAAGAG
[209070 rows x 3 columns]
I see, it makes sense. It works with .fasta
, thanks for clarifying. One more question. Are you using fasta indexing (e.g. pyfaidx) to accelerate downstream tasks?
Cool, glad that part works. BioBear only supports indexed BAM/VCF/GFF files at the moment (https://www.wheretrue.dev/docs/exon/sql-reference#indexed-scan-functions) ... I've been meaning to add indexed fastas and the underlying bits are in place, so I'll add it today and follow up on this ticket.
Here's an example of using an faidx index with a fasta file... it's a bit limited in that it only supports a single region. Is that similar to how you're using (py)faidx or is it will multiple regions and/or a regions file?
The builds for this to go to pypi are on their way, and should be up in ~2hrs... https://github.com/wheretrue/biobear/actions/runs/8051376387
biobear/python/tests/test_session.py
Lines 77 to 88 in 7357fb8
Cool, let me give it a try and I'll get back to you soon. Thanks for your quick responses!
Here's an example of using an faidx index with a fasta file... it's a bit limited in that it only supports a single region. Is that similar to how you're using (py)faidx or is it will multiple regions and/or a regions file?
What I'm trying to do is simply (!) loading FASTQ data as dataframe and then using "groupby" function from polars to count all unique sequences. Then, I want to map the counted sequences to a reference fasta
file (or just a table of expected sequences and their names). Does it make sense? This is my current code to load FASTQ, count, and then map ...
I also need to ask your opinion about multi-threading on FASTQ files to run these kinds of tasks with safe memory usage (I can open a new issue to discuss that separately).
Hey @tshauck, I'm not sure if I understand your fasta indexing implementation. How should I create fasta
index using biobear
? Also, I'm not sure what function I need to call for using it.
Thanks for sharing the notebook. I think I generally understand the linked part, but looks interesting overall. This comment got a little long, but it's mostly code snippets... there's a paragraph on reproducing parts of your notebook with SQL, one on preprocessing to parquet for getting good I/O performance, and one to your last comment.
SQL
To the workflow, it's not super clear to me if slice_seq
is static across the entire dataframe or different for each row, but either way I think you could push down some of this to SQL which would likely give you better performance. E.g. if it's static:
df = session.sql(
"""
SELECT substr(f.sequence, 1, 10) AS sequence, COUNT(*) as count
FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz') f
GROUP BY substr(f.sequence, 1, 10)
"""
).to_polars()
Which would do the counting and reading at the same time.
Or if you have the start and stop from another place (like a CSV), you could do something like:
session = bb.connect()
session.execute(
"""
CREATE EXTERNAL TABLE positions
STORED AS CSV
WITH HEADER ROW
LOCATION './examples/positions.csv'
"""
)
df = session.sql(
"""
SELECT sequence, COUNT(*) AS count
FROM (
SELECT substr(f.sequence, positions.start, positions.stop - positions.start) AS sequence
FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz') f
JOIN positions
ON f.name = positions.name
)
GROUP BY sequence
"""
).to_polars()
I realize SQL may not be totally in your wheelhouse, but just to put it out there since it can actually be pretty good for stuff like this.
On a FASTQ file with about 2.5M records, this takes about 45 seconds, with the vast majority spent on IO:
In [3]: %%time
...: df = session.sql(
...: """
...: SELECT sequence, COUNT(*) AS count
...: FROM (
...: SELECT substr(f.sequence, positions.start, positions.stop - positions.start) AS sequence
...: FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz') f
...: JOIN positions
...: ON f.name = positions.name
...: )
...: GROUP BY sequence
...: """
...: ).to_polars()
...:
...:
CPU times: user 47.7 s, sys: 972 ms, total: 48.6 s
Wall time: 26.5 s
Depending on how library
is stored in map_sample_counts_to_library
, you also might move that "down" to the SQL layer as well.
Parquet
To the last point, because I/O takes up the majority of time, if write the file to parquet format first, you get much better I/O for subsequent read and polars can work really well with Parquet. E.g....
In [11]: %%time
...: session.execute(
...: """
...: COPY
...: (SELECT * FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz'))
...: TO './sequences.parquet'
...: (FORMAT PARQUET)
...: """
...: )
...:
...:
CPU times: user 28.1 s, sys: 889 ms, total: 28.9 s
Wall time: 23.8 s
In [12]: %%time
...: pl.read_parquet("./sequences.parquet")
...:
...:
CPU times: user 1.16 s, sys: 407 ms, total: 1.57 s
Wall time: 1.01 s
Out[12]:
shape: (2_385_245, 4)
pyfaidx
My fault for not understanding your workflow. For now, you'd have to use pyfaidx (or samtools) to create the index, but biobear will recognize it and use it on the underlying files to control which bytes actually get read.
Anyways, thanks for raising the issues. Let me know if you have thoughts and I'll check the other issue you raised tomorrow.