wheretrue/biobear

How to load pairs of FASTQ files from paired-end reads

abearab opened this issue · 7 comments

As you know, it's very common to have paired-end reads mainly in Illumina NGS technologies. I was wondering to know your recommendation for having a single session for processing paired-end FASTQ files. Looking forward to hear your thoughts.

There are several common computational task with paired-end reads, here are some examples:

1. trimming task

2. merge-paired-end-sequences

It might be a separate conversation but it can be a good idea if your tool can have guidance for these tasks. Based on what I have learned about your tool so far, I assume these task can be very efficient with the SQL core in your tool and it can be combined with other tasks at the same time.

Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example.

>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
   ...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
   ...:      two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
   ...:      
   ...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
   ...: FROM one
   ...:   JOIN two
   ...:     ON one.trimmed_name = two.trimmed_name
   ...: 
   ...: """).to_polars()

Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing /1 or /2), then the body joins the two files together on that new name and returns a bit of info about both reads. You could add quality score or description if you want those.

WITH one AS (
    SELECT REPLACE(name, '/1', '') trimmed_name, *
    FROM fastq_scan('paired.1.fastq')
), two AS (
    SELECT REPLACE(name, '/2', '') trimmed_name, *
    FROM fastq_scan('paired.2.fastq')
)

SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
  JOIN two
    ON one.trimmed_name = two.trimmed_name

The results of which looks like:

shape: (4, 5)
┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐
│ trimmed_name ┆ one_name ┆ one_sequence                   ┆ two_name ┆ two_sequence                   │
│ ---          ┆ ---      ┆ ---                            ┆ ---      ┆ ---                            │
│ str          ┆ str      ┆ str                            ┆ str      ┆ str                            │
╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ read5        ┆ read5    ┆ TTATTTGTCTCCAGC                ┆ read5    ┆ CAACAGGCCACATTAGACATATCGGATGGT │
│ read1        ┆ read1/1  ┆ TTATTTGTCTCCAGC                ┆ read1/2  ┆ GCTGGAGACAAATAA                │
│ read4        ┆ read4/1  ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2  ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │
│ read3        ┆ read3/1  ┆ CCAACTTGATATTAATAACA           ┆ read3/2  ┆ TGTTATTAATATCAAGTTGG           │
└──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘

I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.

import biobear as bb

s = bb.connect()

s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")

s.sql("SELECT * FROM test").to_polars()
┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐
│ name    ┆ description ┆ sequence                       ┆ quality_scores                 │
│ ---     ┆ ---         ┆ ---                            ┆ ---                            │
│ str     ┆ str         ┆ str                            ┆ str                            │
╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡
│ read1/1 ┆ some text   ┆ TTATTTGTCTCCAGC                ┆ ##HHHHHHHHHHHHH                │
│ read3/1 ┆ null        ┆ CCAACTTGATATTAATAACA           ┆ HHHHHHHHHHHHHHHHHHHH           │
│ read4/1 ┆ null        ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │
│ read5   ┆ null        ┆ TTATTTGTCTCCAGC                ┆ #####HHHHHHHHHH                │
└─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘

Hi @tshauck – thank you so much for your response! I'll look more carefully and will get back to you.

Actually, my main goal for loading both R1/R2 reads at the same time is doing this:

  1. trimming each read from certain position
  2. counting unique sequences based on both reads

Basically, I'm going to expand this function to work with more complex inputs: https://github.com/ArcInstitute/ScreenPro2/blob/master/screenpro/ngs.py#L26-L50

Do you have any suggestion? Thanks agin for your insightful responses.

Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.

Thanks for the link 👍 , I'm gonna work through the trimming doc (#103) and how it refers to your link tonight/tomorrow morning and I'll follow-up then w/ more.

Awesome! Looking forward to that.

I'm also excited to hear your thoughts about this #102 (comment) but there is no rush; whenever you have time please let me know what you think. I already released a new sub-bersion for my tool which is now using biobearArcInstitute/ScreenPro2@85445c6. I'll add paired-end read support in the future (hopefully next few weeks).

I could follow your example here and it worked well. Thanks!!

ArcInstitute/ScreenPro2@fe9a27d

Thanks as always for filing the issue. In the short term please see the following example for loading pair end reads. I also agree w/ you point about having some guidance around trimming, and filed in ticket to update the docs with an example.

>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.1.fastq
>>> !wget https://raw.githubusercontent.com/marcelm/cutadapt/main/tests/cut/paired.2.fastq
>>>
>>> import biobear as bb
>>> session = bb.connect()
>>>
>>> session.sql("""
   ...: WITH one AS (SELECT REPLACE(name, '/1', '') trimmed_name, * FROM fastq_scan('paired.1.fastq')),
   ...:      two AS (SELECT REPLACE(name, '/2', '') trimmed_name, * FROM fastq_scan('paired.2.fastq'))
   ...:      
   ...: SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
   ...: FROM one
   ...:   JOIN two
   ...:     ON one.trimmed_name = two.trimmed_name
   ...: 
   ...: """).to_polars()

Outside of python the SQL looks like the following. The idea is to create two CTEs which are used to standardize the read names (removing /1 or /2), then the body joins the two files together on that new name and returns a bit of info about both reads. You could add quality score or description if you want those.

WITH one AS (
    SELECT REPLACE(name, '/1', '') trimmed_name, *
    FROM fastq_scan('paired.1.fastq')
), two AS (
    SELECT REPLACE(name, '/2', '') trimmed_name, *
    FROM fastq_scan('paired.2.fastq')
)

SELECT one.trimmed_name, one.name one_name, one.sequence one_sequence, two.name two_name, two.sequence two_sequence
FROM one
  JOIN two
    ON one.trimmed_name = two.trimmed_name

The results of which looks like:

shape: (4, 5)
┌──────────────┬──────────┬────────────────────────────────┬──────────┬────────────────────────────────┐
│ trimmed_name ┆ one_name ┆ one_sequence                   ┆ two_name ┆ two_sequence                   │
│ ---          ┆ ---      ┆ ---                            ┆ ---      ┆ ---                            │
│ str          ┆ str      ┆ str                            ┆ str      ┆ str                            │
╞══════════════╪══════════╪════════════════════════════════╪══════════╪════════════════════════════════╡
│ read5        ┆ read5    ┆ TTATTTGTCTCCAGC                ┆ read5    ┆ CAACAGGCCACATTAGACATATCGGATGGT │
│ read1        ┆ read1/1  ┆ TTATTTGTCTCCAGC                ┆ read1/2  ┆ GCTGGAGACAAATAA                │
│ read4        ┆ read4/1  ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ read4/2  ┆ CATCCCGTCAACATTCAAACGGCCTGTCCA │
│ read3        ┆ read3/1  ┆ CCAACTTGATATTAATAACA           ┆ read3/2  ┆ TGTTATTAATATCAAGTTGG           │
└──────────────┴──────────┴────────────────────────────────┴──────────┴────────────────────────────────┘

I'd also just add that session will persist state within a given program or interpreter session. So you can create a table backed by a file, then query it multiple times if needed.

import biobear as bb

s = bb.connect()

s.execute("""
CREATE EXTERNAL TABLE test STORED AS FASTQ LOCATION 'paired.1.fastq'
""")

s.sql("SELECT * FROM test").to_polars()
┌─────────┬─────────────┬────────────────────────────────┬────────────────────────────────┐
│ name    ┆ description ┆ sequence                       ┆ quality_scores                 │
│ ---     ┆ ---         ┆ ---                            ┆ ---                            │
│ str     ┆ str         ┆ str                            ┆ str                            │
╞═════════╪═════════════╪════════════════════════════════╪════════════════════════════════╡
│ read1/1 ┆ some text   ┆ TTATTTGTCTCCAGC                ┆ ##HHHHHHHHHHHHH                │
│ read3/1 ┆ null        ┆ CCAACTTGATATTAATAACA           ┆ HHHHHHHHHHHHHHHHHHHH           │
│ read4/1 ┆ null        ┆ GACAGGCCGTTTGAATGTTGACGGGATGTT ┆ HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH │
│ read5   ┆ null        ┆ TTATTTGTCTCCAGC                ┆ #####HHHHHHHHHH                │
└─────────┴─────────────┴────────────────────────────────┴────────────────────────────────┘

@abearab Nice, congrats on the release! And glad that worked well enough.

I am still working on the adapter/quality trimming info. I'm trying to add some functionality as I go to make it easier to use, e.g. SELECT quality_scores_to_list('!!!') returning [0, 0, 0]. (https://github.com/wheretrue/exon/pull/427/files)