nicolas-comerci/precomp-cpp

Question about data integrity specific to precomp

Closed this issue · 4 comments

Ok, before this goes into depth about precomp, know that what I'm about to explain happened a long time ago, and I do not have this data anymore.

Before I start, I just wanted to say a HUGE thanks for adding pipes, that's one feature I was really hoping that will be added to precomp.

Ok so, a long time ago, with the original project (so not this fork), I happened to have a really large file, I'd say like 200GB or more, idk, and this was modified with precomp + lzma. This was also at a time where I did not have the original copy, though I had a git folder, but anyway, I lost all of that data. Now I may of had at least an area in the source code where precomp had issues decoding, but I don't anymore, I basically scrapped everything I had with precomp at that point.

The issue had to do with precomp's brute feature. I was able to recompress, meaning I was able to decompress the data, but could not extract the original data precomp had encoded. All I know was somehow, precomp's brute feature managed to ruin the data (I'm really guessing here, I have no idea), in which I lost all of my data that precomp had encoded (I only managed to get like 7GB out of the tar).

Does precomp use any checksum like other programs to verify data integrity? Precomp does have a good feature when you really want to do even more compression, but with already compressed data, in this context, data that can actually be compressed more if formatted in a certain way. I know bzip2 and lzip have some sort of data recovery built in, I don't know how to use it, but I don't use BWT or LZMA anyway, I've always have been using ZPAQ for everything now.

I'd like to get back into using precomp, if I can really make use of the recompression feature, however, precomp was the only suspected program I've dealt with during compression that I was unable to fix if something had gotten corrupted (I'm no expert in this stuff, I just tinker with compression benchmarking).

I'll be happy to try and see if I can get the brute feature to mess up but I'm honestly not sure what happened, other than whatever got corrupted, it had to do with the brute feature trying to decode (this was also at a time where I didn't know -intense could be added alongside -brute).

Tl;Dr

  • Yeah, it sucks
  • Checksums by themselves won't really solve the problem, at best they will just help you notice a problem happened
  • My plan to address this and hopefully make Precomp much more reliable is to add a verification step to the streams precompression process, probably even make it the default

My reasoning

Ultimately it all comes down to Precomp, neither the original not this fork, ever reaching a point where it could come out of beta, be considered "Production ready".
It is one of my major concerns now that pretty much all of the structural changes I envisioned for Precomp Neo are in place.
I do of course agree: what is the point of ending with a smaller file with Precomp if my data could be gone?

Fortunately, Precomp actually works pretty well, most of the time, but it does fail currently to recompress some particular streams (I've seen it fail silently, so not crash or hang just reconstruct wrong data, with Zlib or Deflate streams as you describe, but also PNG fails and other formats).

I do recommend ensuring your data is recoverable with Precomp if you will actually store it and delete the original.
That is, recompress and ensure it doesn't crash or hang AND make a checksum check.
Heck, you might even want to store the Precomp executable you use alongside your data to ensure if anything breaks/becomes incompatible in a future version (I'll try to make it not happen but you know, stuff happens) you have the actual version you know can recover your files.

In so far as checksums, no, Precomp doesn't store checksums for the precompressed streams through I don't think that's too important and I will explain why:
You will usually use precomp in conjunction with something else, 7zip/Xz/Lzip, piping it through SREP, etc.
Most likely, whatever you use there, like in the examples I used, has checksums so in so far as detecting bit flips or that kind of data corruption that might happen later in storage or transferring the file, you should be covered.

Unless we actually do a verification step that the precompressed data can be recompressed into data with a matching checksum, storing the checksum seems mostly pointless, at most it will allow Precomp to clearly state during recompression that an error happened, but not really allow us to know during precompression.

So if you ask for my plans for the near future (as much as my time allows for continued work on this project) is to continue mooching off whatever else you are using alongside Precomp for checksums, at least for now, and make a feature for immediately verifying streams after precompressing them.
I picture a --verify option that does precisely that, if Precomp runs into any sort of error in the process, or the data does not pass a checksum check, then the stream is rejected and stored as uncompressed data.

Thanks for the detailed response! I appreciate the length of detail you've explained. I definitely agree with everything you've mentioned.

Your work to precomp has given me hope for precomp again, so good on you for making those additions!

I will be looking forward to the next additions and will be on the lookout for that verifying. I know you have a good idea of what you're doing as I'm just a script kiddie, but I am always interested in data compression. Honestly, its been the main source of interest for me the past year or so. I'm trying to learn lz77 and make my own encoder and decoder, but even that is quite the project, but I'll still try to eventually get there (I only know Python but its still doable).

Anyway, you can either close this or do whatever you think is best, I don't mind. Again, thanks for your time and I hope to see precomp become the tool you envision!

@Merculous v0.4a now verifies the precompressed streams before outputting them.
So now situations like this should be much less likely.

Other bugs remain of course, which I will continue to fix, but v0.4a should be addressing most of your concerns in terms of Precomp's general reliability

Awesome! Thanks for the update. I think it's a good time to close. Cheers!