wheretrue/biobear

`FastaReader` returns empty pandas dataframe

abearab opened this issue · 8 comments

I'm trying to load this FASTA data using FastaReader module but it seems there is something wrong! Any thoughts?!

image


Python implementation: CPython
Python version       : 3.9.18
IPython version      : 8.18.1

Compiler    : GCC 11.2.0
OS          : Linux
Release     : 5.10.0-28-cloud-amd64
Machine     : x86_64
Processor   : 
CPU cores   : 16
Architecture: 64bit

matplotlib: 3.6.3
screenpro : 0.2.6
anndata   : 0.10.5.post1
biobear   : 0.14.6
sys       : 3.9.18 (main, Sep 11 2023, 13:41:44) 
[GCC 11.2.0]
pandas    : 1.5.3
polars    : 0.20.10

Thanks for filing this, and sorry for the issue! I think the issue is due to the .fa extension. In the short term, could you rename the file to with .fasta? And I can make it easier to configure in the medium term. Like you probably know fasta has a variety of extensions compared to something like bam or vcf.

In [1]: import biobear as bb

In [2]: !wget -O CRISPRi_v2_human.trim_1_29_forward.fasta https://raw.githubusercontent.com/mhorlbeck/ScreenProcessing/master/library_reference/CRISPRi_v2_human.trim_1_29_forward.fa
--2024-02-22 17:06:25--  https://raw.githubusercontent.com/mhorlbeck/ScreenProcessing/master/library_reference/CRISPRi_v2_human.trim_1_29_forward.fa
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11799370 (11M) [text/plain]
Saving to: ‘CRISPRi_v2_human.trim_1_29_forward.fasta’

CRISPRi_v2_human.trim_1_29_forward.fasta                  100%[====================================================================================================================================>]  11.25M  16.2MB/s    in 0.7s

2024-02-22 17:06:27 (16.2 MB/s) - ‘CRISPRi_v2_human.trim_1_29_forward.fasta’ saved [11799370/11799370]


In [3]: df = bb.FastaReader('CRISPRi_v2_human.trim_1_29_forward.fasta').to_pandas()

In [4]: df
Out[4]:
                           id description                      sequence
0       A1BG_-_58858617.23-P1        None  GAGACCCAGCGCTAACCAGGTTTAAGAG
1       A1BG_-_58858788.23-P1        None  GGGCACCCAGGAGCGGTAGGTTTAAGAG
2       A1BG_+_58858964.23-P1        None  CTCCGGGCGACGTGGAGTGGTTTAAGAG
3       A1BG_-_58858630.23-P1        None  AACCAGGGGTGCCCAAGGGGTTTAAGAG
4       A1BG_+_58858549.23-P1        None  GCGAGGAACCGCCCAGCAAGTTTAAGAG
...                       ...         ...                           ...
209065    non-targeting_03785        None  CTACTCCGCCCCGCGGGAGGTTTAAGAG
209066    non-targeting_03786        None  TGGCCGTTCATGGGACCGGGTTTAAGAG
209067    non-targeting_03787        None  AACTCTGTAGAAGGGACCGGTTTAAGAG
209068    non-targeting_03788        None  CTCGACAGCGACTGAAGAGGTTTAAGAG
209069    non-targeting_03789        None  AGAAGCCAACCTCGCGTTAGTTTAAGAG

[209070 rows x 3 columns]

I see, it makes sense. It works with .fasta, thanks for clarifying. One more question. Are you using fasta indexing (e.g. pyfaidx) to accelerate downstream tasks?

Cool, glad that part works. BioBear only supports indexed BAM/VCF/GFF files at the moment (https://www.wheretrue.dev/docs/exon/sql-reference#indexed-scan-functions) ... I've been meaning to add indexed fastas and the underlying bits are in place, so I'll add it today and follow up on this ticket.

Here's an example of using an faidx index with a fasta file... it's a bit limited in that it only supports a single region. Is that similar to how you're using (py)faidx or is it will multiple regions and/or a regions file?

The builds for this to go to pypi are on their way, and should be up in ~2hrs... https://github.com/wheretrue/biobear/actions/runs/8051376387

def test_indexed_scan():
"""Test an indexed fasta can works."""
session = connect()
fasta_file = DATA / "test.fasta"
query = f"SELECT * FROM fasta_indexed_scan('{fasta_file}', 'a:1-2')"
results = session.sql(query)
df = results.to_polars()
assert len(df) == 1, df

Cool, let me give it a try and I'll get back to you soon. Thanks for your quick responses!

Here's an example of using an faidx index with a fasta file... it's a bit limited in that it only supports a single region. Is that similar to how you're using (py)faidx or is it will multiple regions and/or a regions file?

What I'm trying to do is simply (!) loading FASTQ data as dataframe and then using "groupby" function from polars to count all unique sequences. Then, I want to map the counted sequences to a reference fasta file (or just a table of expected sequences and their names). Does it make sense? This is my current code to load FASTQ, count, and then map ...

https://github.com/ArcInstitute/ScreenPro2/blob/c3e476f1a75edb8d39ba466078e1204d652e9db0/screenpro/ngs.py#L27C1-L91C21

I also need to ask your opinion about multi-threading on FASTQ files to run these kinds of tasks with safe memory usage (I can open a new issue to discuss that separately).

Hey @tshauck, I'm not sure if I understand your fasta indexing implementation. How should I create fasta index using biobear? Also, I'm not sure what function I need to call for using it.

Thanks for sharing the notebook. I think I generally understand the linked part, but looks interesting overall. This comment got a little long, but it's mostly code snippets... there's a paragraph on reproducing parts of your notebook with SQL, one on preprocessing to parquet for getting good I/O performance, and one to your last comment.

SQL

To the workflow, it's not super clear to me if slice_seq is static across the entire dataframe or different for each row, but either way I think you could push down some of this to SQL which would likely give you better performance. E.g. if it's static:

df = session.sql(
    """
SELECT substr(f.sequence, 1, 10) AS sequence, COUNT(*) as count
FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz') f
GROUP BY substr(f.sequence, 1, 10)
"""
).to_polars()

Which would do the counting and reading at the same time.

Or if you have the start and stop from another place (like a CSV), you could do something like:

session = bb.connect()

session.execute(
    """
CREATE EXTERNAL TABLE positions
STORED AS CSV
WITH HEADER ROW
LOCATION './examples/positions.csv'
"""
)

df = session.sql(
    """
SELECT sequence, COUNT(*) AS count
FROM (
    SELECT substr(f.sequence, positions.start, positions.stop - positions.start) AS sequence
    FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz') f
        JOIN positions
            ON f.name = positions.name
)
GROUP BY sequence
"""
).to_polars()

I realize SQL may not be totally in your wheelhouse, but just to put it out there since it can actually be pretty good for stuff like this.

On a FASTQ file with about 2.5M records, this takes about 45 seconds, with the vast majority spent on IO:

In [3]: %%time
   ...: df = session.sql(
   ...:     """
   ...: SELECT sequence, COUNT(*) AS count
   ...: FROM (
   ...:     SELECT substr(f.sequence, positions.start, positions.stop - positions.start) AS sequence
   ...:     FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz') f
   ...:         JOIN positions
   ...:             ON f.name = positions.name
   ...: )
   ...: GROUP BY sequence
   ...: """
   ...: ).to_polars()
   ...: 
   ...: 
CPU times: user 47.7 s, sys: 972 ms, total: 48.6 s
Wall time: 26.5 s

Depending on how library is stored in map_sample_counts_to_library, you also might move that "down" to the SQL layer as well.

Parquet

To the last point, because I/O takes up the majority of time, if write the file to parquet format first, you get much better I/O for subsequent read and polars can work really well with Parquet. E.g....

In [11]: %%time
    ...: session.execute(
    ...:     """
    ...:     COPY
    ...:         (SELECT * FROM fastq_scan('./624-02_lib26670_nextseq_n0240_151bp_R1.fastq.gz'))
    ...:     TO './sequences.parquet'
    ...:     (FORMAT PARQUET)
    ...: """
    ...: )
    ...: 
    ...: 
CPU times: user 28.1 s, sys: 889 ms, total: 28.9 s
Wall time: 23.8 s

In [12]: %%time
    ...: pl.read_parquet("./sequences.parquet")
    ...: 
    ...: 
CPU times: user 1.16 s, sys: 407 ms, total: 1.57 s
Wall time: 1.01 s
Out[12]: 
shape: (2_385_245, 4)

pyfaidx

My fault for not understanding your workflow. For now, you'd have to use pyfaidx (or samtools) to create the index, but biobear will recognize it and use it on the underlying files to control which bytes actually get read.

Anyways, thanks for raising the issues. Let me know if you have thoughts and I'll check the other issue you raised tomorrow.

Hi @tshauck, Thanks for all the feedbacks! I really liked using SQL for IO tasks on FASTQ files. I seems to outperform other methods I tried so far! I even don't need the FASTA indexing for now. I'm closing this issue and will be in touch again since I really liked your tool!