lmdu/pyfastx

Filter fastq file (performance)

Maarten-vd-Sande opened this issue · 1 comments

I have a paired-end single-cell RNA-seq dataset. R1 consists of all the reads, and R2 of the barcodes necessary to identify which cell a read belongs to. If I now only trim R1 to keep high-quality reads, my R1 and R2 are out of sync..

It seems like pyfastx can solve this problem for me, by simply only keeping the reads in R2 of which we have one in R1:

import gzip

with gzip.open(output, 'wt') as f:
    for read in reads:
        barcode = barcodes[read.id]
        f.write(barcode.raw)

However from a really sloppy benchmark, it seems like just getting barcode.raw takes around 0.0015 seconds per read, the lookup of the read is fast: 1e-6. This is would mean I have to wait two days to filter my fastq. Is there an easier/better/faster way of doing this?

lmdu commented

For your case, barcodes[read.id] means random access to read from file with given id. This step will firstly extract read information from index file (a sqlite database file). This may be very slow when processing large numbers of reads.

I am a little confused that why do you use read.id rather than read.name to extract reads. That means your reads in this two files are synchronous.

If you use name to extract reads from another file, you can use multiple threads to speedup.