mirage/mrmime

Bigstring on rosetta package

Opened this issue · 1 comments

Currently, rosetta works on Bytes.t. A translation from an encoding to UTF-8, we choose this kind of buffer mostly because uutf works on Bytes.t. However, angstrom works on bigstring and, in other side, fe (internal encoder of mrmime) works with both.

So, because rosetta is under my responsibility, I can decide to provide a translation from a bigstring input. But the code will change a lot - and internals stuffs will change.

From my point of view and mostly because I did lot of benchmarks with buffet, we get the same and big question: should enforce to use Bytes.t or Bigstring.t or functorize it or use an (G)ADT about the input? From benchmarks, functor is the best (and flambda) will be able to optimize it easily - specialization of the functor.

So we have different plans:

  • (middle) functorize rosetta (and pecu, and uuuu, and coin, and yuscii).

This solution move the boilerplate on rosetta, then we can do application of functor in mrmime and use only bigstring. From this point, we avoid most of copies when we translate an input from an encoding to UTF-8. However, we continue to have copies to the uutf part (which uses only Bytes.t).

From benchmarks, it's the best solution even if I don't like to functorize all things. flambda then will be able to optimize it and readability of code is kept instead the second solution which need to put a witness to any functions which manipulates input.

  • (middle) (G)ADT - (decompress's solution)

Avoid the functor but put an argument, the witness in any functions which manipulates input. flambda is not really able to optimize it and specialization (even if we use GADT) is hard.

  • move to bigstring (angstrom's solution)

According to angstrom which use a bigstring, we can move to this solution and enforce to use only bigstring on rosetta (and so on packages). However, we lost the capabilities to use Bytes.t in some cases. But in performance perspective, this is the best choice.

In my opinion, the first case should be the best but ... eh an other functor and after my story with ocaml-git I'm little bit sick with it. Bref, I let this issue because the question stills open.

avsm commented