ropensci/fastMatMR

ENH: Handle `.mtx.gz`

HaoZeke opened this issue · 4 comments

future suggestion: support reading .mtx.gz files as I can download from for example https://math.nist.gov/MatrixMarket/data/NEP/quebec/qh1484.html ?

First noted here: ropensci/software-review#606 (comment).

This is a significant feature request, and can be done without much difficulty outside of R at the moment. Will be handled in a later release.

It's possible that this could be easy.

A big reason FMM uses iostreams is to enable use of existing libraries for specialized uses like this one. Here are two ideas.

Idea one: Use a GZip iostream wrapper. There are a bunch of lightweight-ish ones on GitHub, just note that some have dependencies on zlib. If your users are largely using precompiled binaries then Boost has a good one: https://www.boost.org/doc/libs/1_83_0/libs/iostreams/doc/classes/gzip.html . This would be a 4-line solution, like the example on the bottom of that page.

Idea two: Do what the Python binds do: provide an adapter between the stream types of Python and iostreams, then use Python's GZip decompressor. This adapter may or may not exist for R, I was lucky to find one for Python. The upside is that you can also use it to adapt all streams for that language (Python users often use StringIO/ByteIO objects), though I'm not sure if that's a common usage pattern in R. The upside is the extra flexibility and not having to maintain gzip/bz2 or whatever dependencies. The downside is that the adapter is likely slower than native C++ file IO, so you'll want two code paths. Gzip decompression is slow anyway.

Thanks a ton for looking into this.

It's possible that this could be easy.

A big reason FMM uses iostreams is to enable use of existing libraries for specialized uses like this one. Here are two ideas.

Idea one: Use a GZip iostream wrapper. There are a bunch of lightweight-ish ones on GitHub, just note that some have dependencies on zlib. If your users are largely using precompiled binaries then Boost has a good one: https://www.boost.org/doc/libs/1_83_0/libs/iostreams/doc/classes/gzip.html . This would be a 4-line solution, like the example on the bottom of that page.

Idea two: Do what the Python binds do: provide an adapter between the stream types of Python and iostreams, then use Python's GZip decompressor. This adapter may or may not exist for R, I was lucky to find one for Python. The upside is that you can also use it to adapt all streams for that language (Python users often use StringIO/ByteIO objects), though I'm not sure if that's a common usage pattern in R. The upside is the extra flexibility and not having to maintain gzip/bz2 or whatever dependencies. The downside is that the adapter is likely slower than native C++ file IO, so you'll want two code paths. Gzip decompression is slow anyway.

I think idea two is feasible for R as well (due to its transparent support of Gzip files)

However, from a performance perspective idea one is way nicer, and if the dependencies are not too heavy (i.e. the resulting library with dependencies is small enough for CRAN and builds quickly enough, within 30 min) then that would the best option probably.

I will investigate both :)

Curious what you decide on!

I bet the performance will be comparable, since it'll likely be zlib doing the work either way. Just who wraps it better :P