Overall optimization
mandel01 opened this issue · 7 comments
There are (at least!) two overall issues needed for optimization. First is more efficient file reading and writing. Second is a better overall plan for not going over the same data multiple times (likely involving numpy and/or pandas).
Will return to both of these once the full pipeline is available here and can be run with test data and timed.
Processing is pretty good now. Demultiplexing still takes a few hours on a laptop and could be improved.
Reframing this question.... We need to figure out whether we should optimize our current demultiplex code (e.g., wrapping C/C++) or use refined packages that are already optimized. I presume it is the latter. Will return to this issue down the road once the pipeline is more mature.
For demultiplex_fastq
function in demultiplex.py
, we can use groupby
function in order to group keys together. For a larger scale, we can also use PySpark to do groupby
function too. Below is my snippet:
from itertools import groupby, chain
def add_barcode(seq, barcodes_dict):
"Grab sequence of barcode and add to dictionary"
if seq.sequence[0:4] in barcodes_dict.values():
seq['barcode'] = seq.sequence[0:4]
else:
seq['barcode'] = '_other'
return seq
demultiplex_dict = {}
for sampleName, barcode in barcodes_dict.items():
demultiplex_dict[barcode] = []
seqfile = screed.open(fastq_file)
seqs = [s for s in seqfile] # read sequence using screed
seqfile_barcode = list(map(lambda seq: add_barcode(seq, barcodes_dict), seqs)) # get 4 characters barcode
# here we group dictionary by barcode that we created
for (key, group) in groupby(seqs, lambda x: x.barcode):
demultiplex_dict[key].extend(list(group))
I haven't tested on larger scale. For example snippet, this code uses about the same time as before. I hope this snippet will scale better with larger data. If itertools
is slow, we can also try pandas
library.
Can you send the entire function/set of functions? Or if you have it forked then I can look at the fork. I'm excited to give it a try but I'm having trouble just seeing how this snippet works. Thanks!!
Hi @titipata, this passes the simple tests so seems to function properly. On timing you're right that it's similar. I like how your design is more modular. I'm traveling over the next 1-2 weeks and will dig into it more but I'm sorry that my responses will likely continued to be a bit delayed over these next couple weeks.