ulikunitz/xz

expose blockreader

frrad opened this issue · 4 comments

frrad commented

I'm working on a library to do random access in xz files with multiple blocks. I would love to use your library to do the heavy lifting instead of reinventing the wheel. I need to use some of the internal pieces though, including the blockReader

xz/reader.go

Line 261 in 067145b

type blockReader struct {
and related parts.

Would you be open to a PR which refactors things so your original package keeps the same public interface but uses a new ulikunitz/xz/lib/xzinternals which makes public some of these currently private structs?

The block reader is intentionally left private, so that it is possible to change it without affecting users of the package. If the details would make public I would be forced to maintain the public interface. You are always free to branch the code, but I understand that you want to keep it upstream to benefit the improvements. Doing it will require to convince me that your approach can even work.

In a pure Lempel-Ziv compressed file random access doesn't make sense, because each part of compressed file depends on the window before it. So it makes only sense if parts of the file are compressed separately that are independent of each other. Now xz is a container format, which allows you to do something like this. Actually that could also be implemented using the LZMA2 format which is used inside the xz block.

But that means you will have to create xz files as a sequence of small blocks. So you would need not only to modify the reader but also the writer, which results in the requirement to also publish the block writer interface, which is not mentioned in your request.

How do you want to deal with the creation of the files?

frrad commented

Hi! I considered forking and if we can't find a way to do this that works for you I still might. As you say I would prefer to upstream my changes so I can benefit from future improvements that you make.

For my use case I don't need programmatic creation of the xz file. The plan is just to create it with xv --block-size ??. For reading, I plan to read the index and use that to only decompress the necessary blocks as described in this blog post https://blastedbio.blogspot.com/2013/04/random-access-to-blocked-xz-format-bxzf.html.

I understand your concerns about having to maintain the public interface to the internals. If you're open to this change we could work together to make the package boundary more streamlined and not just the sort of sloppy make-everything-public strategy I have in #29 right now.

Alternatively, if you're interested perhaps I could add the functionality I want directly to your package? This could potentially look like a new Seek method on your Reader type or maybe a new ReadSeeker type. One potential selling point here is that I suspect that the work required to understand the index for random access will make parallel decompression relatively straightforward to implement.

frrad commented

Sent a PR for the addition that I described above #31. I would love to work to add this to your lib instead of having to fork if you're open to it.

Appreciated. As I wrote it may take until the weekend to review it.