iconnect/api-tools

Switch to released version of serialise

Closed this issue · 17 comments

Now that serialise (formerly binary-serialise-cbor) has been released to Hackage, we should make use of it and publish a new api-tools release. This requires a little care: the new library should be backwards-compatible (i.e. new code ought to read old data, though this needs testing) but is probably not forwards-compatible (i.e. old code may not be able to read new data).

Fire when ready!

August is of course the best time to do this.

So far I'm having trouble even compiling with the released binary-serialise-cbor. Where is the source code (or a tag or commit hash) of the private version of cbor that was used for api-tools? I could compare it and/or use it for GHC 8.0.2 port based on public hackage, until api-tools are ported..

@Mikolaj I wasn't involved in the sharp end of this — I expect @adamgundry will explain in the morning.

@Mikolaj we previously used well-typed/cborg@05a7ebf (up to version number):

api-tools/.travis.yml

Lines 23 to 25 in 1e59a89

- git clone https://github.com/well-typed/binary-serialise-cbor.git
- cd binary-serialise-cbor
- git checkout -b 0.1.1.0.1 05a7ebf8a2e1af0feb61bc85b828a712e95762e8

@Mikolaj how would you feel about tackling the port to cborg/serialise? I'm moderately hopeful that it won't be too much work, and would let us tidy all this up.

@adamgundry: thank you. Wow, the commit is from over a year and a half ago. Yes, I'd like to port api-tools to the new version. Will report here.

A report of the WIP: I've managed to make it compile in the most lazy way possible. In particular, I haven't yet used the new JSON<->CBOR code to get rid of the old api-tools code doing similar things (the APIs seem different and I haven't yet identified which parts are functionally similar).

All tests pass except for 3 tests called MyUTC:

        MyUTC:                              FAIL
          *** Failed! (after 1 test): 
          Post-processed: Object (fromList [("1",Number 1.328197511e9),("-12",Number 0.0)])
          Direct JSON:    String "2012-02-02T15:45:11Z"
          2012-02-02 15:45:11 UTC
        MyUTC:                              FAIL
          *** Failed! (after 1 test): 
          Failed with JSON:      String "2013-05-27T19:13:50Z"
          Generic serialisation: "\192t2013-05-27T19:13:50Z"
          Direct serialisation:  "\217\ETX\232\162\SOH\SUBQ\163\176n+\NUL"
          2013-05-27 19:13:50 UTC
        MyUTC:                              FAIL
          *** Failed! Decode error: decodeString: unexpected token TkMapLen 2 (after 1 test): 
          2013-05-27 19:13:50 UTC

I've recompiled serialise-0.1.0.0 with flag newtime15, but no difference. Will dig into that after lunch. Hints welcome. :)

Report continued: I figured out, at least the last error is caused by the new way cborg now serializes UTCTime (not as a String), introduced on 2017-02-13 in commit

Improve UTCTime encoding

The encodings given by the specification are either inefficient (a ISO8601
string) or imprecise (offset from the epoch in seconds). Here we implement the
extended time encoding currently under consideration as an IETF working draft.

This is done via class instances, so for now I intend to track down all uses of UTCTime in tests (and in main code, if needed, let's hope not) and will convert it to String using the function that was previously used to serialize UTCTime in cborg:

formatUTCrfc3339 = formatTime       defaultTimeLocale "%Y-%m-%dT%H:%M:%S%QZ"

In the future we may want to use the new UTCTime serialisation format, but I expect this is not entirely trivial, due to the need to keep backward compatibility. Edit: also, we somehow tweak the old String serialisation due to old aeson behaviour and I don't know if I can drop the hacks.

Ah, UTC format serialisation issues, those bring back (bad) memories...

It would be worthwhile if possible to move to the new serialisation format for UTCTime, because we have a lot of timestamps and (de)serialising them more quickly should yield a nontrivial perfomance win. This is a significant part of the appeal for switching to serialise. I hope that we can do so, maintaining backwards compatibility by supporting deserialisation of the old encoding (but losing forwards compatibility). The main places I see that need to change are in Data.API.JSONToCBOR: jsonToCBORBasic needs to convert to the new encoding, and I suspect postprocessJSONType will need to do some work as well. We already have to do some work to deal with mismatches between the JSON and CBOR formats, so hopefully a bit more won't be too terrible.

@adamgundry: for now, I'm aming low, just to make it work. Unfortunately that involves main code, not just tests. We already have a manually written TH encoder :: NormAPI -> APIType -> ExpQ, but no decoder was needed, because previous CBOR format was fortunately compatible with our manual serialization of UTCTime. We were truncating time, similarly to aeson, in the encoder and that was enough. Now I've manually written a CBOR instance for UTCTime that matches the old behaviour of CBOR, but that is not enough, I also need a corresponding whole decoder to embed in the TH code, because I'm changing the UTCTime case of it and so I need to also spell out all the pipework. At least, I hope to avoid any migration between formats at this stage. Next step will be using the new CBOR format, but that's after I close the atlast+hermes work.

@Mikolaj on the decoder side, doesn't cborg still support decoding the old format anyway? I see we need to change our encodeUTCTime when we change the format, but I think the normal CBOR decoder should support both formats. Or am I missing something?

@adamgundry: you are right, it tells which format is used by the initial tag:

https://github.com/well-typed/cborg/blob/master/serialise/src/Codec/Serialise/Class.hs#L1326

and it did use that tag in the version used in api-tools before:

https://github.com/well-typed/cborg/blob/05a7ebf8a2e1af0feb61bc85b828a712e95762e8/Data/Binary/Serialise/CBOR/Class.hs#L235

as well as in the manual encoding in api-tools. So that looks like perfect backward compatibility, so it seems I don't need to write the decoder manually, after all. \o/

Some of the test failures do seem caused by a mix up of the encodings, but this may be some shallower problem. Will diagnose again today.

@adamgundry: it seems the current api-tools implementation insists (and assumes) on serialising UTCTime as a String, because then it can quickly decode into the Value type (and JSON?):

https://github.com/iconnect/api-tools/blob/master/src/Data/API/Value.hs#L88

https://github.com/iconnect/api-tools/blob/master/src/Data/API/Value.hs#L257

That's where (some of) the test failures come from, when using new cborg (because the tests in some cases employ normal cborg serialisation, not our custom one, and hence the encoding assumption doesn't hold; also elsewhere the tests probably look at the CBOR AST and expect some form of the CBOR terms).

I could improve the code by analysing the tag and converting from the UTCTime type to String, but perhaps the old setup is crucial for efficiency? It stores UTCTime as strings in all types (JSON also?) and serializes it as strings as well. We could change Value to store time as UTCTime, but it probably needs to match JSON. I just don't know enough of the context and rationale of these decisions, so help would be much apreciated. :)

Let me try to give some background and rationale for what is going on here...

Our Data.API.Value:Value type provides a generic data representation for doing migrations. The original migration code used Aeson's Value type, but that was unnecessarily expensive for our purposes. We still support converting back and forth between the representations, and performing migrations using the Aeson representation, for backwards compatibility and testing (including tests that the various encodings agree).

(Note that all this is separate from what happens when loading data normally, rather than migrating. In that case, we use the Serialise or FromJSON[WithErrs] instances generated for individual datatypes by TH. Of course, it's important that the generic and TH-generated encodings agree.)

It's important that CBOR encoding/decoding our Value should be fast. It doesn't matter if using Aeson's Value type, or JSON encoding/decoding our Value, are slow, because we don't typically use those operations for production data.

Since both JSON and CBOR originally used a consistent string representation of UTCTime, our generic value representation could leave the string alone (avoiding parsing it, because this is expensive, and we don't need the actual time values decoded when doing migrations). Now that CBOR uses a faster representation, I think we should make the UTCTime constructor of Value actually contain a UTCTime and encode/decode it normally, paying the cost of time parsing/formatting only when going via JSON.

Clear as mud?

Thank you, that helps a lot. So, by switching everywhere to the new CBOR encoding, I can fix the test failures (which expect consistent encoding) and, in some cases, simplify the code as well.

I've done the basic transition and I right now have 10 times more test failures :) but that's only because the old encoding is hardwired in some extra places, probably only for testing or perhaps also for migration (but I really hope I can put the new encoding there or at least that tests catch it, if not), e.g.

https://github.com/iconnect/api-tools/blob/master/src/Data/API/JSONToCBOR.hs#L102

Sounds good. Yes, the JSONToCBOR code will need updates: converting the time-as-string to the new CBOR rep in jsonToCBORBasic as you've spotted, and I think postprocessJSONType may also need adapting to convert whatever cborToJson produces of the new time encoding into the string representation.

Yay, the two fixes you mentioned brought test failure count from 23 to 2: Mikolaj@40a836f

BTW, I will request a review, when I'm done, because there is too much guessing on my part (e.g., how exactly CBOR represents the time as AST terms). Also, probably errors could be more accurate and perhaps I'm not handling all possible JSONs (order of fields, etc.).