spotify/magnolify

Parquet TODO

nevillelyh opened this issue · 2 comments

  • Avro array support in AvroWriteSupport - old TwoLevelListWriter vs new ThreeLevelListWriter
  • Avro nullabe arrays and arrays of nullables
  • Fix parquet.avro.data.supplier with generic records in test #278
  • Schema compatibility check in ReadSupport 2aea4e8
  • Schema evolution for enums #290
  • Schema evolution for arrays 6c00ecb

Turns out the new 3 level list is more complex.

With the default 2 level list, myField: List[T] is written as:

required group myField (LIST) {
  repeated T array;
}

But the Avro counter part is still "name": "myField", "type": "array", "items": T

While with 3 level list, the Parquet schema becomes:

required group myField (LIST) {
  repeated group list {
    required T element;
  }
}

And the Avro record becomes [{"element": t1}, {"element": t1}]...

WIP in https://github.com/spotify/magnolify/tree/neville/pq-avro

More on Avro array mapping. The following Avro fields

{"name": "field1", "type:" {"type": "array", "items": "string"}, "default": [] } // required array field that defaults to empty array
{"name": "field2", "type:" ["null", {"type": "array", "items": "string"}], "default": null } // nullable array field that defaults to null

map to:

required group field1 (LIST) {
  repeated binary array (STRING);
}
optional group field2 (LIST) {
  repeated binary array (STRING);
}