gabledata/recap

Validate CST translation to/from Avro, JSON schema, and Proto

Opened this issue · 2 comments

This is a uber-ticket to track work for CST validation when converting between (both to and from) Recap schemas and other serdes (Avro, JSON schema, Proto).

The idea is to define a test directory like:

  • /tests/converters/0000
    • from.avsc
    • from.proto
    • from.json
    • recap.yaml
  • /tests/converters/0001
    • to.avsc
    • to.proto
    • to.json
    • recap.yaml
  • /tests/converters/0002
    • to.avsc
    • to.proto
    • to.json
    • recap.yaml
    • from.avsc

The test harness would iterate over every test subdir and verify that using a serde's converter and types.py's to_dict/from_dict would result in the expected structures.

For example, in 0000, the test harness would load the from.avsc into a dict, then use AvroConverter.to_recap to convert the Avro schema to a Recap StructType (the Recap AST). It'd then use recap.types.to_dict to convert the Recap AST to the Recap CST (dict). Finally, it'd load the recap.yaml and compare it to the Recap CST--they should be identical. It'd then do the same thing for the .proto and .json files.

In the second directory example it'd do the inverse. In the third directory example (0002) it'd do both to and from checks for the Avro schemas. The reason for splitting to/from is that you might not get the same schema when converting to Recap as when converting from Recap since some conversions are lossy or require coercion.

@mjperrone LMK if this makes sense.

I think building out a test suite like that will be very valuable. There are a lot of good ideas expressed in this. It was a little hard for me to parse.

I think you should put more explicit meaning into the names and directories of the tests, and group the tests semantically.

  • "from" vs "to" tests. I think using input and output and expected in the file names inside the directory will make it more clear.
  • I think the directory should identify which conversion direction is happening.
  • .schema.json to differentiate from .json
  • Instead of numbers, tests named based on what they're trying to test. This I'm not so convinced on. It may be worth it to just put a readme.md in each test directory to explain what each test is trying to cover.
    /tests/converters/into-recap/regex-constraints/
        input.avsc
        input.proto
        input.schema.json
        expected.recap.cst.yaml
    /tests/converters/out-of-recap/min-max-constraints-1/
        expected.avsc
        expected.proto
        expected.json
        recap.cst.yaml

The third test semantic I couldn't wrap my head around.

I like all your suggestions. I'll incorporate them.

Regarding the third test semantic, what I was trying to emphasize was that a single test could validate both avsc -> Recap and Recap -> avsc. The first two tests just validate one direction or the other. I don't think this is strictly required, since you could just have two tests, but I thought it might be more convenient.