johnwhitington/camlpdf

Clash with "official" zlib/camlzip

alainfrisch opened this issue · 13 comments

camlpdf comes with a custom copy of camlzip. This raises a few questions:

  • Are there changes with the upstream version, except for the use of miniz instead of zlib?

  • What's the rationale for using miniz instead of zlib? Licensing issues, packaging issues ("self-contained-ness") or technical reasons?

  • There is a risk of symbol clashes for an application which links with both camlpdf and camlzip. I think this is ok for C symbols between zlib and miniz, since the rebinding from e.g. mz_compress to compress is done through the C preprocessor. But there seem to be clashes for C symbols between flatestubs.c and zlibstubs.c. There is a risk that zlib by camlpdf or miniz used by camlzip. This could be addressed by using different symbol names in flatestubs.c.

  • There is also an issue with the registered exception slot "Zlib.Error" (same name used in both libraries). I've found my program raising Pdfflate.Error("Zlib.deflateInit", "") when the exception was actually raised by the official Zlib and unrelated to the use of camlpdf. This can be solved just by using a different name (in flatestubs.c and pdfflate.ml).

  • For people who need to use camlzip+zlib anyway, do you foresee any problem if changing Pdfflate locally to reuse camlzip?

  • In terms of packaging, I can see that having something self-contained is particularly important, even in these OPAM days, because of Windows. But perhaps the build system could be adapted to make it easy to use camlzip+zlib?

I am on holiday, so these are preliminary responses mostly from memory. We should fix this, though. In order:

  1. No changes with upstream.
  2. Self-containedness.
  3. Yes, we can change the symbols. No problem. In fact, we could remove all C stubs which we don't use too. CamlZip doesn't change often.
  4. Again, we can change this error. We aren't supposed to match on error strings these days anyway, so I don't think it will affect anyone.
  5. No, should be fine. zlib was used previously.
  6. The self-containedness is not really about Windows, since I supply binaries for that, and OPAM sort-of-works on Windows now. The issue is customers with obscure things like HPUX / AIX etc. I need to send them source, and the instructions need to be as simple as "install ocaml toolchain. build cpdf". These customers are few and far between, but customers using odd hardware tend to be customers with lots of money...

I think this should be fixed for the next release. I am happy to do everything myself except for the last bullet point (because I don't really understand makefiles well enough). Would you like to propose a patch which adapts the build system for camlpdf+zlib as you suggest in your last bullet point?

Thanks for your reply. I have switched to using zlib internally (just replace the content of pdfflate.ml with include Zlib and not linking the correponding .c file) and confirm everything is ok.

For 4, note that this not about the error string, but the exception constructor.

I'm not very familiar with Makefiles either (we integrate camlpdf in our monorepo controlled by our global omake -- we don't use the shipped Makefile), but I'll try to have a look.

What about even going further and use https://github.com/mirage/decompress? Using it will eliminate the need of linking with C stubs.

This possibility is already being tracked here: #26

From some of the comments here about PDF-specific problems:

#19

It looks like we might end up having to customize (i.e fork) CamlZip anyway. In which case the namespace issues can be fixed, for example. I'll look at this again for the release after the one which I am preparing.

For future reference: the same problems exists with the vendored AES implementation (the original rijndael implementation + the OCaml stubs from https://github.com/xavierleroy/cryptokit). Whatever solution is put in place for zlib (renaming C symbols, or allowing to use an external package) should probably be done for that one as well.

The following two commits (in order) should address these issues, for zlib and encryption, by prefixing the C stubs and OCaml function names, and by changing the name of the exception.

93397cd

faaae12

For miniz.h, I invoked a # DEFINE which stops export of the zlib names, so it won't link if I don't use miniz-specific names.

I'm currently running tests to make sure nothing has been broken by this, but if you could take a quick look at the diffs to let me know if you think I'm on the right track that would be great -- I know little about C or linking!

Thanks for looking into this!

I'm unsure about the direction. I think OCaml developers would prefer if camlpdf relied on upstream camlzip (which is itself self-contained) and libzip, and the renaming goes against this goal. It's common these days to have dependencies between OCaml packages; and even on Windows, this should be ok.

You mentioned customers of cpdf (the application), which just want to get it compiled and don't want to develop with camlpdf (as an OCaml library) themselves, IIUC. Could the self-containedness be addressed at the level of the cpdf packaging for your customers (i.e. shipping everything in one package)? I think it makes sense to vendor dependencies as one possible way to ship an application (esp. if it is not targeted specifically to the "OCaml" community); not so much for a library.

For AES, it's a bit less satisfactory, because cryptokit is monolithic, and brings further dependencies (Zarith), which are not needed for the parts used by camlpdf. It seems the best solution would be that cryptokit would be split into smaller parts, to avoid dragging unneeded dependencies, but this is not under you control! Ping @xavierleroy, just in case.

Thanks.

I'm unsure about the direction. I think OCaml developers would prefer if camlpdf relied on upstream camlzip (which is itself self-contained) and libzip, and the renaming goes against this goal. It's common these days to have dependencies between OCaml packages; and even on Windows, this should be ok.

I have already forked CamlZip, because of changes I needed to make to do with the treatment of malformed PDF files. I don't believe these patches would be appropriate upstream in CamlZip, since they would hinder error recovery in normal (non-PDF) situations. Though its possible these patches wouldn't be needed with ZLib, and that it is just miniz.c requiring them:

camlpdf/flatestubs.c

Lines 124 to 127 in faaae12

/* CamlZIP now treats Z_BUF_ERROR as non-fatal. However, this can lead to lack
* of progress on some malformed streams (at least with miniz.c -- maybe not
* zlib. So we have this hack. */
int camlpdf_buf_error_count;

Longer term, I think we should have an alternative release of CamlPDF for opam which relies on standard CamlZip/Zlib. This might go together with other changes such as using dune as the build system per #28

About the patch for malformed streams : either the patch is not needed with zlib, which would suggests that relying on official versions (camlzip+zlib) is indeed better; or it is needed, I would suggest that upstreaming (perhaps some variants of) the patch to camlzip might be useful to others (if the broken input stream causes a segfault or hanging process with camlzip+zlib, I'd say there is a bug in camlzip or zlib); perhaps with some explicit control over how errors are supported. If camlpdf requires some special support, perhaps other projects that need to parse possibly malformed streams would need it as well.

@johnwhitington reported the problem with ill-formed ZIP streams to me a while ago (xavierleroy/camlzip@20a0ae2#commitcomment-45177837), and I agree it would be nice to handle it inside camlzip, but I'm still not sure what to do: retry the operation on error but only a small number of times? I have no idea how Zlib and its alternate implementations really mean when they return a Z_BUF_ERROR code...

Concerning Cryptokit: I agree it's too big a dependency if you only need AES. Not sure I want to invest much effort in splitting Cryptokit in small pieces. There are more modern crypto libraries for OCaml (e.g. hacl-star and mirage-crypto) that look more active.

One more general comment: there is a balance to be found between a self-contained library or application that builds reliably all the time but doesn't play nice with other libraries, and a minimal library that depends on many other libraries and has to work around their bugs. As an example of the latter, many NPM Javascript packages provide just one function, and I strongly doubt this is an example we should follow.

Thanks for the pointers @xavierleroy.

I have just released v2.4 with these naming changes, but I'm going to leave this bug open for now.