mjmlab/pyinseq

Overall optimization

mandel01 opened this issue · 7 comments

There are (at least!) two overall issues needed for optimization. First is more efficient file reading and writing. Second is a better overall plan for not going over the same data multiple times (likely involving numpy and/or pandas).

Will return to both of these once the full pipeline is available here and can be run with test data and timed.

Processing is pretty good now. Demultiplexing still takes a few hours on a laptop and could be improved.

Reframing this question.... We need to figure out whether we should optimize our current demultiplex code (e.g., wrapping C/C++) or use refined packages that are already optimized. I presume it is the latter. Will return to this issue down the road once the pipeline is more mature.

For demultiplex_fastq function in demultiplex.py, we can use groupby function in order to group keys together. For a larger scale, we can also use PySpark to do groupby function too. Below is my snippet:

from itertools import groupby, chain

def add_barcode(seq, barcodes_dict):
    "Grab sequence of barcode and add to dictionary"
    if seq.sequence[0:4] in barcodes_dict.values():
        seq['barcode'] = seq.sequence[0:4]
    else:
        seq['barcode'] = '_other'
    return seq

demultiplex_dict = {}
for sampleName, barcode in barcodes_dict.items():
    demultiplex_dict[barcode] = []

seqfile = screed.open(fastq_file)
seqs = [s for s in seqfile] # read sequence using screed
seqfile_barcode = list(map(lambda seq: add_barcode(seq, barcodes_dict), seqs)) # get 4 characters barcode

# here we group dictionary by barcode that we created
for (key, group) in groupby(seqs, lambda x: x.barcode):
    demultiplex_dict[key].extend(list(group))

I haven't tested on larger scale. For example snippet, this code uses about the same time as before. I hope this snippet will scale better with larger data. If itertools is slow, we can also try pandas library.

Can you send the entire function/set of functions? Or if you have it forked then I can look at the fork. I'm excited to give it a try but I'm having trouble just seeing how this snippet works. Thanks!!

Sure @mandel01, I'm working on it so it works for example example02_AAAA.fastq dataset here.

@mandel01, I make changes in this fork. (I just fixed typos in the function)

Hi @titipata, this passes the simple tests so seems to function properly. On timing you're right that it's similar. I like how your design is more modular. I'm traveling over the next 1-2 weeks and will dig into it more but I'm sorry that my responses will likely continued to be a bit delayed over these next couple weeks.