parallel reading 10G fasta/fastq file

Question

parallel reading 10G fasta/fastq file

Opened this issue 2 years ago · 5 comments

Dear seq_io team,

I have many 10G fasta/fastq files (10,000) and I want to read each file in parallel (the order of each record in the file does not matter) so that I can accelerate reading all those files. What is the best way you will recommend?

Thanks,

Jianshu

Answer 1 · 2022-12-05T17:36:35.000Z

Hi! There is an implementation of parallel processing, the documentation can be found here. Although there is still the description "experiments with parallel processing" in these docs, I think that the code works well. It is used in seqtool and so far I haven't found any bugs.

Parallel processing as implemented in the parallel module is intended for use cases involving some time consuming analysis of the sequences, which are done in worker threads, and the results are then passed on to the main thread along with the corresponding sequence records (but not necessarily in same order as in the file).

Does this apply to your use case?

Generally I hope to make the API nicer and more usable in the next version.

Answer 2 · 2022-12-05T18:02:18.000Z

Hello! Yes, I was doing some very expensive tasks based on those fasta records for each file (the expensive tasks have already been parallelized). I was using needletail for fasta parsing but for large files, it only supports sequential parsing, which is very slow for 10G files. Memory is not a problem for me, I just want parse those fasta files as fast as possible because this is now the limiting step. I will send the parsed records to other threads from main threads for the expensive tasks but I need to get those records very fast using all threads I have on the machine (I have 128 threads maximum).

Thanks,

Jianshu

Answer 3 · 2022-12-05T18:04:11.000Z

An advantage I see for needletail is that they support .gz et.al. compressed format, which is nice because 10G fasta is only 2G after compressed, which is a huge advantage.

Answer 4 · 2022-12-05T20:13:37.000Z

How large are single records in your case? If you have many (possibly thousands) of sequence records per 10G FASTA file, the functions in seq_io::parallel should work well in distributing these records across 128 threads for processing even with sequential parsing. The size of the input file does not matter, you can even process a stream from STDIN (however, it may be slower than directly reading from file). Whether sequential or parallel parsing performs better may also depend on how how fast the file I/O is in both scenarios (is it a performance bottleneck?). Maybe even a combination of both could make sense? E.g. grouping the files into batches, which are read in parallel by threads (or processes), each of which sequentially reads and processes the files in a few worker threads... Of course depending on your data flow you may then again have to merge the output somehow.

Regarding needletail, unfortunately I don't know this library well, so I can't say how seq_io and needletail compare in performance and which one is better suited for you. Generally, seq_io has a strong focus on the basic task of reading and writing and does not provide advanced features as needletail does. However, reading from compressed files can still be implemented relatively easily. See here for some example with dynamicly creating a Box<std::io::Read> depending on the input (uncompressed or compressed with different formats), which can then be used as input by a seq_io reader. Non-buffered readers should be fastest, since seq_io manages its own buffer.

Answer 5 · 2022-12-05T22:19:15.000Z

yes. about 10^7 150bp to 250bp records per file. So called metagenomic sequencing dataset. So it can be distributed to 128 threads I imagine.