Top level array of strings field not working on Athena
billjohnston opened this issue · 1 comments
Having trouble getting a top level array field to work with Athena
Athena create table query:
CREATE EXTERNAL TABLE IF NOT EXISTS db.table (
id STRING,
...etc
tags ARRAY<STRING>
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://bucket'
parquetjs schema:
new ParquetSchema({
id: { type: 'UTF8' },
...etc
tags: { type: 'UTF8', repeated: true },
})
I'm able to upload and create the parquet file with this, and queries that don't include the tags
field are working fine:
SELECT id from db.table
But if I run any query that includes the tags
field I get this error:
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<path_to_the_parquet_file> (offset=0, length=8612):
org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO
I noticed there is a LIST
field type that is supposed to work with Athena, but I'm not sure how I'd specify a top level list of strings
Did you ever figure this out? I spent a couple hours today and finally came up with this schema:
new parquetJs.ParquetSchema({
names: {
type: 'LIST',
fields: {
list: {
repeated: true,
fields: {
element: {
type: 'UTF8'
}
}
}
}
}
})
I also had to prepare the data going into the names field by wrapping values with list
and element
, like so:
{
names: {
list: [
{ element: 'cathy' },
{ element: 'Zooplar the Magnificent' },
{ element: 'tim' }
]
}
Using this pattern, AWS Glue sees the names field as array<string>
, and they show up in Athena queries.
It feels like this was a LOT harder than it should be, so if anyone knows of an easier way, please do let me know. 😅