spotify/scio

Better projection support for Parquet SMB reads

clairemcginty opened this issue · 0 comments

parquet-avro supports Schema projections that exclude required fields. However, if a required field is excluded, the Avro record will fail Coder roundtrip during the next PTransform.

As a workaround in scio-parquet, we provide a custom map API that's applied immediately to the Parquet record before it undergoes Coder serialization, so that you can map the record to a serializable type; and in SortMergeTransform, you can do this with a custom via() function.

However, there's no support for such a projection function for regular Parquet SMB CoGroups/GroupByKeys. We'd have to add support for a SerializableFunction inside MultiSourceKeyGroupReader.