Support for map Parquet legacy format
smf-srogozins opened this issue · 4 comments
Hello, this is likely related to #184
I am using parquet4s 2.6.0, which as far as I understand uses parquet-mr 1.12.0, and I need to read some files that were generated with parquet-mr 1.11.0. The issue is that the files contain map fields which and apparently the logical name for them in parquet schema has changed between versions from map
to key_value
. Last version of parquet4s that was using 1.11.0 is 1.7.0, which is a pretty big downgrade. I see something related to it in spark code as well:
Is there an already present workaround in parquet4s I can use for this, or can the support for legacy map type be added as well?
Hi @smf-srogozins,
Parquet4s should be able to read both the current and the legacy format of maps. It ignores the name of the group (it doesn't care if it is named map
or key_value
). The name is used only during writing. You can check it in the code: https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/ParquetRecord.scala#L682. If I missed something and the backwards compatibility is not met then do not hesitate and propose a fix. PRs are warmly welcome :)
OK, that's weird. To double check this, I tried generating a similar file with version 1.11.0 of parquet4s and I am not seeing the error with that file. Need to spend more time to understand the cause, because the error specifically complains about key_value not found in optional group myMap (MAP)
. Not sure if that is useful, but this also only seems to occur when I am using projection, also I believe the failing parquet files were generated using Apache Iceberg.
Not sure if that is useful, but this also only seems to occur when I am using projection
Oh, yes, it is useful. When using projection, you define the exact schema you expect from the file you read. And when using Parquet 1.12, the schema will contain key_value
, which does not match your data.
There are two ways to work around this:
- do not use projection (which is not ideal, I guess)
- provide your own schema definition for a Map; that is, you have to implement your own type class, like this one: https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/Schema.scala#L308 but using your own field names.
Hint: you can use https://github.com/mjakubowski84/parquet4s/blob/master/core/src/main/scala/com/github/mjakubowski84/parquet4s/Schema.scala#L82 to build your own schema def easily.