mjakubowski84/parquet4s

Support for map Parquet legacy format

smf-srogozins opened this issue · 4 comments

Hello, this is likely related to #184

I am using parquet4s 2.6.0, which as far as I understand uses parquet-mr 1.12.0, and I need to read some files that were generated with parquet-mr 1.11.0. The issue is that the files contain map fields which and apparently the logical name for them in parquet schema has changed between versions from map to key_value. Last version of parquet4s that was using 1.11.0 is 1.7.0, which is a pretty big downgrade. I see something related to it in spark code as well:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L412-L437

Is there an already present workaround in parquet4s I can use for this, or can the support for legacy map type be added as well?

Hi @smf-srogozins,
Parquet4s should be able to read both the current and the legacy format of maps. It ignores the name of the group (it doesn't care if it is named map or key_value). The name is used only during writing. You can check it in the code: https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/ParquetRecord.scala#L682. If I missed something and the backwards compatibility is not met then do not hesitate and propose a fix. PRs are warmly welcome :)

OK, that's weird. To double check this, I tried generating a similar file with version 1.11.0 of parquet4s and I am not seeing the error with that file. Need to spend more time to understand the cause, because the error specifically complains about key_value not found in optional group myMap (MAP). Not sure if that is useful, but this also only seems to occur when I am using projection, also I believe the failing parquet files were generated using Apache Iceberg.

Not sure if that is useful, but this also only seems to occur when I am using projection

Oh, yes, it is useful. When using projection, you define the exact schema you expect from the file you read. And when using Parquet 1.12, the schema will contain key_value, which does not match your data.
There are two ways to work around this: