elsasserlab/minute

Make demultiplexing faster

cnluzon opened this issue · 9 comments

I have seen cutadapt can run multithreaded. Right now it runs on single core and it's kind of a bottleneck step for large files.

I also suspect this bottleneck is caused by a very slow file system on the cluster, but maybe some multithreading can help nonetheless.

Cutadapt can run multithreaded, but not when demultiplexing. There are other ways to speed it up, though. I’ll look into this.

One very easy improvement is to use a lower gzip compression level for the output files. Compressing the output takes roughly half of the runtime at level 6, which is currently the default. We can use option -Z to set the compression to level 1. Since the output files from demultiplexing are only intermediate files anyway, it does not matter if they get a bit bigger.

A second improvement is to use the --no-indels option. With this, Cutadapt can build an index of all adapter sequences and is much faster when many adapters (or in this case: barcodes) need to be found at the same time. This does change results slightly because barcodes with an insertion or a deletions will then no longer be found. I will run a test to see how the numbers change.

Some measurements (when demultiplexing into six output files):

Runtime per read (µs)
Current 60
With --compression-level=4 40
With -Z 36
With --compression-level=4 and --no-indels 32
With -Z and --no-indels 28

Edit: Compression level 4 added.

I also tested whether it might help not to write the files in which no barcode has been found (that is, omitting --untrimmed-(paired-)output), but it doesn’t make a difference.

That's quite a time reduction, it should be good enough. I really think most of the problem comes from the storage on the cluster being slow. /proj/ directories really have slow access time these days. I am also in the process of clearing a lot of clutter in our quota because it being kind of full probably does not help either.

It could also help (even though I am not sure if it is a possibility) to put the demultiplexed files on the SCRATCH directory of the node that is executing the job.

Note that also these demultiplexed files are to be kept in many cases, because when we make uploads to GEO database they do not accept raw pooled samples.

But I think it should not be a problem to have larger files if it significantly improves the processing speed. We could re-compress them after if really necessary. If this was part of the pipeline this would not be a bottleneck anymore.

I was going to open a new issue regarding moving stuff over to /scratch, will do so now.

If the demultiplexed files are needed, we should not put them into the tmp/ subdirectory. Compression level 1 really does result in files that are quite a bit bigger, but we could go to level 4, which is also quite an improvement (I’ve added it to the table above).

We can do level 4 and see how much time is reduced, I still need to run a quite big dataset that was taking many hours to demultiplex, very strange.

Also agree about demultiplexing out of tmp/.

I have now tested --no-indels on the first 3 M reads of the test dataset I’ve been using.

With indels allowed, barcodes were found in 54.8% of the input reads. With --no-indels, that was reduced to 53.8%. So some reads are lost, but at an acceptable level in my opinion (and also considering that the previous version of the pipeline did not consider indels either). Because there is some loss, however, I suggest we go ahead with the other improvements and then revisit --no-indels if that isn’t enough.

This change alleviates quite a bit the timing problem.

For the largest file pair I have in my hands (which are 18GB FASTQ files) it would timeout a process of 15 hours! time limit on the cluster. After changing the compression level, it still runs on something more than 8 hours. Which is not the case for most of the files we have. I have made the test of running it locally on my laptop, and the amount of time it takes goes down to 4 and a half hours. It also does not seem to be predictable how much time is going to take.

I think it's very likely that moving the demultiplexing to $TMPDIR and then copying the files after everything is done could help - That constant I/O on a set of open files on /proj/ directories may be just problematic at this point. Even though it shuoldn't be.

I have spent a little bit of time on fast demultiplexing in Cutadapt even when indels are allowed in the barcode sequence, and except for some minor details I need to fix, this is working now (see marcelm/cutadapt#486). The more barcode sequences are given, the bigger the speedup, so when many libraries are multiplexed, either this speedup or using --no-indels becomes even more important. Anyway, this is something that doesn’t need to be tracked here, we’ll just need to update to a newer Cutadapt version when it comes out.

Also the other things that we have discussed here have been implemented (#72, #77) or are being tracked in other issues (#71), so I’d like to close this issue now. I want to continue the $TMPDIR discussion, but let’s do that in #71. (Feel free to re-open if you disagree.)