single-thread mode
notestaff opened this issue · 4 comments
Is it possible for seqan3-based programs to use only one CPU? I tried setting seqan3::contrib::bgzf_thread_count
to 1, but the BAM-reading program still uses 200% CPU according to GNU time: one main thread and one for seqan3's decompression. Looking at the code, setting seqan3::contrib::bgzf_thread_count to 0 would not be supported, correct?
I'm trying to make a CLI like that of samtools: using one CPU by default, with an option to specify additional CPUs. Is there a way to do that? Thanks!
@eseiler
The bgzf handling is built around using a threadpool, so it always spawns at least one thread.
It should be possible to use the constructor via stream for the input.
If there should only be one thread, the gz_stream could be used, which should work for bgzf compressed files.
So:
- check if file is BAM/bgzf compressed (easy, but not reliable: file extension. Harder, but reliable: magic number)
- if not: just use Sam_file_input with filename
- construct fstream from file
- construct gz stream from fstream
- use sam_file_input ctor with gz stream and format_bam as format
Haven't tried it yet, but this would be my hacky workaround.
As for our code:
It should be possible to just use gz (for input) if there is one thread requested. Not sure about performance implications for decompressing (probably none?). We can't really do it for output, because we would then write a gz file instead of bgzf.
This seems to me like a recurring issue and I am wondering if the mechanism to switch to gz-decompression in favor of bgzf-compression should be more straightforward to handle in the API.
This seems to me like a recurring issue and I am wondering if the mechanism to switch to gz-decompression in favor of bgzf-compression should be more straightforward to handle in the API.
I agree.
Another thing we had is that we used to write bgzf
files when gz
output was requested.
bgzf
is faster because it can be parallelised. However, bgzf
is not the same as gz
, though it's compatible.
The binary representation is different and the file size differs (I think I had a case were a bgzf
compressed FASTA file was 20% bigger than the gz
compressed counterpart).
True. Following this, I could make out the following four possible decisions that could be made by the user:
On output
- Use bgzf for output compression
- default by spec
- random access support
- serial (no separate decompression thread) or parallel (at least two threads: 1 main, >= 1 decompression worker)
- Which mode is default? If parallel how many threads are default?
- Allow user to explicitly switch to gz-compression
- no random access support
- always single-threaded
On Input
- Use bgzf-decompression if bgzf-decompressed
- default by spec
- always parallel
- serial (no separate decompression thread) or parallel (at least two threads: 1 main, >= 1
- Allow user to explicitly use gz-decompression
- always serial
- independent of bgzf or gz-compression