competition
Closed this issue ยท 5 comments
Hi,
If you are interested in more competition, you might want to add my pfasta parser to your benchmarks. It doesn't do FASTQ, but prints nice error messages. ๐
Best,
Fabian
Hi Fabian,
Given this is a bottleneck in many bioinformatics workflows, I'd certainly love to see more rigorous benchmarks on FAST(A/Q) parsing (like the benchmarks game or TechEmpower's server benchmarks here or elsewhere. :)
This is a little out-of-scope for our aims with this repo right now (our primary goal was preventing memory/threading issues by using Rust and speed is a happy benefit!), but I'll leave this issue open as a reminder for the future. Do you have any thoughts on what an expanded suite of benchmarks should test? We were just counting bases, but there may be utility to more slightly more complicated tests like GC calculation or the number of reverse-complemented k-mers equal to some randomly chosen value (e.g. "ACGT").
Thanks for the idea!
Roderick
Given this is a bottleneck in many bioinformatics workflows, I'd certainly love to see more rigorous benchmarks on FAST(A/Q) parsing
In my experience, I/O is much faster than processing the data, thus I focused on correctness, rather than speed. But your mileage may vary.
Do you have any thoughts on what an expanded suite of benchmarks should test?
Check resilience to errors in the input. My pfasta repo contains a bunch of tests for edge cases and even provides a nice error messages to the user as to how and why parsing went wrong. I even had it fuzzed to ensure it doesn't crash.
@kloetzl Thanks for that link. Those edge cases certainly look like they're worth adding to our tests here!
Closing this issue in favor of #34. Note that we're now checking correctness against the https://github.com/BioJulia/FormatSpecimens.jl repo which seems to have a good collection of unambiguous test cases (pfasta is strictly more correct in that it rejects e.g. empty records which we need to support because of downstream applications).
(pfasta is strictly more correct in that it rejects e.g. empty records which we need to support because of downstream applications)
That was a deliberate choice. Most Unix programs don't care if the input is empty cat/head/sort etc. However, I think it is annoying when you run a long analysis only to later realize that one of the input files was corrupt.