[Feature Request] Generate all HiFi reads in a single file per sample

Question

[Feature Request] Generate all HiFi reads in a single file per sample

Opened this issue 8 months ago · 2 comments

Hello again,

Thanks for your incredible help in using this software. While everything is working now, I think another significant increase in efficiency could come from generating one output file per input file (as compared to current system which generates, to my knowledge, one output file per chromosome/DNA fragment). Perhaps this could be enabled as an option?

Currently I am (and I assume other users are as well) cating together all of the output files into a single input file. An example follows:

            sed '1,2d' -i $INTERMEDIATE_DIR$OUTPUTNAME*".sam" # remove headers from files
            cat $INTERMEDIATE_DIR$OUTPUTNAME*".sam" > $INTERMEDIATE_DIR$OUTPUTNAME".sam" # 
            rm $INTERMEDIATE_DIR$OUTPUTNAME"_0"*

Please let me know if you have any other questions about the desired functionality. And once again, thanks so much for an incredibly useful algorithm!

All the best,

Joe

Answer 1 · 2024-02-27T01:48:37.000Z

Thank you for your suggestion.
Since one output file per chromosome/DNA fragment is a small burden on the user, we will not add new functionality right away, but we plan to improve the usability of PBSIM in consideration of changes in demand for long read simulators in the future.

Answer 2 · 2024-03-01T11:30:45.000Z

Hey, thank you for your timely reply, and a huge thanks to considering this as a future feature. In the meantime, for highly fragmented/ large genomes, do you have any suggestions for how to make the process more efficient? For instance, are there any ways to multithread sed (to remove file headers), or ways to speed up cat? I assume the primary limiting factor is disk read/write speeds?

Any help is much appreciated.