Replace extension based compression detection by crates niffler
Opened this issue · 7 comments
Hello,
I'm one of the niffler crate developers and I think you might be interested in this crate.
Niffler allows to open gzip, bzip2, lzma (xz) or zstd compressed files transparently just by calling niffler::get_reader
. Format detection is based on the magic number at the beginning of the file, not the extension (no need to trust the file name).
If you're interested, I can write a pull request.
I'd welcome a pull request, but before you do, I'd think getting a thumbs up from @tfenne makes sense
Yes please! I'm curious, on the writer side, how things work? Do you auto-pick compression based on the extension, or do you require users to specify?
And thank you!
We require users to specify (never trust a filename)
never trust a filename
I mostly agree, however, you have to trust them at some point (e.g. specifying the type of compression)? Perhaps if no compression type is given, we fall back on the file extension detection? And if compression is given, we check the file extension against the few known ones so they don't mismatch, but continue on if the file extension is unknown?
This would also be a great time to solve how to specify the compression parameters for a wide variety of compression types (see: #9 (comment)). I see in niffler there are 22 levels, which is needed for zstd, but what is level 22 for zlib?
The choice I've made in several applications is that if the input is compressed in one format, the output is compressed in the same format, leaving the user free to choose the output format via a parameter. As for the compression level, I've chosen to keep the default compression levels (niffler doesn't detect the compression level used).
If the user isn't satisfied with this behavior, he can send the uncompressed output as standard output and pass it on to his preferred compression tools with the parameters he has chosen.
After all, this is a library, not an application, so we don't necessarily need to make this choice right now.
About compression levels, in niffler if ever the level of compression is too high for the format, we go back to the maximum level for the chosen format.
Thanks @natir for the explanation and suggestions! I will keep #9 open for the time being, primarily for reference, but happy to close&replace it with your PR that utilizes niffler.
My personal opinion on the reader/writer side of things would be as follows:
-
Reader: auto-detect input file compression via niffler but also check if the extension matches any of the "accepted" extensions for that file format, and at least give a warning otherwise (e.g. filename is "test.vcf.gz" but somehow it's an lzma compressed file)
-
Writer: default behavior is using the same compression format detected from the input as you've mentioned. Once again, an extension check might be useful as to not detect gzip with a ".zst" extension and still try to write a ".zst.gz" file. Unlikely, but just in case perhaps... As for levels, the current mechanism used by niffler seems reasonable to me.
Finally, an interesting(?) case I could think of is something like #8 , where a VCF.gz could be read as a gzipped file but most downstream tools expect it to be written as a bgzf. I might have missed it but I couldn't see a BGZF module/support in niffler. Would it make sense to add it (assuming it actually isn't there and I didn't miss it) and have a rule like "if file format is VCF, even if original compression is gzip, writer will default to bgzf" or would that be too much?