parallel parsing a lot of fasta files?
Closed this issue · 5 comments
Hello needletail team,
I have more than 1 million fasta files to parse, each is about 3-5 Mb, total is about 3T, I am wondering how I can read those huge number of files in parallel using all cpu cores I have. It seems there are not such tool available right now. Any suggestions?
Thanks,
Jianshu
Use https://github.com/rayon-rs/rayon to split the work, that's what we do.
Is that the finch crate?
Jianshu
Hello Keats,
Can you please give an example how you will do it. My thinking is that I create an iterator from file path of each fasta file (dirwalk or something) and use par_into_inter() combined with the following parse command:
let mut reader = needletail::parse_fastx_file(&pathb).expect("expecting valid filename");
to allow parallel parsing.
Thanks,
Jianshu
My thinking is that I create an iterator from file path of each fasta file (dirwalk or something) and use par_into_inter() combined with the following parse command:
Yep exactly that.
It's pretty much the default example (https://github.com/onecodex/needletail/blob/master/src/lib.rs#L15-L39) except you would have something like that around:
let results: Vec<_> = files.par_iter().map(|f| {
// Here you can put the snippet from the example
// I'm returning a vec in that example, maybe you don't need to return anything
// You can use for_each instead of map then
}).collect();
Thanks!
This is very helpful!
Jianshu