Support for map Parquet legacy format

Question

Support for map Parquet legacy format

smf-srogozins opened this issue 2 years ago · 4 comments

Hello, this is likely related to #184

I am using parquet4s 2.6.0, which as far as I understand uses parquet-mr 1.12.0, and I need to read some files that were generated with parquet-mr 1.11.0. The issue is that the files contain map fields which and apparently the logical name for them in parquet schema has changed between versions from map to key_value. Last version of parquet4s that was using 1.11.0 is 1.7.0, which is a pretty big downgrade. I see something related to it in spark code as well:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L412-L437

Is there an already present workaround in parquet4s I can use for this, or can the support for legacy map type be added as well?

Answer 1 · 2022-11-13T18:36:37.000Z

Hi @smf-srogozins,
Parquet4s should be able to read both the current and the legacy format of maps. It ignores the name of the group (it doesn't care if it is named map or key_value). The name is used only during writing. You can check it in the code: https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/ParquetRecord.scala#L682. If I missed something and the backwards compatibility is not met then do not hesitate and propose a fix. PRs are warmly welcome :)

Answer 2 · 2022-11-14T08:05:41.000Z

OK, that's weird. To double check this, I tried generating a similar file with version 1.11.0 of parquet4s and I am not seeing the error with that file. Need to spend more time to understand the cause, because the error specifically complains about key_value not found in optional group myMap (MAP). Not sure if that is useful, but this also only seems to occur when I am using projection, also I believe the failing parquet files were generated using Apache Iceberg.

Answer 3 · 2022-11-14T08:23:17.000Z

Not sure if that is useful, but this also only seems to occur when I am using projection

Oh, yes, it is useful. When using projection, you define the exact schema you expect from the file you read. And when using Parquet 1.12, the schema will contain key_value, which does not match your data.
There are two ways to work around this:

do not use projection (which is not ideal, I guess)
provide your own schema definition for a Map; that is, you have to implement your own type class, like this one: https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/Schema.scala#L308 but using your own field names.

Answer 4 · 2022-11-14T09:34:43.000Z

Hint: you can use https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/Schema.scala#L82 to build your own schema def easily.