hnawaz007/pythondataanalysis

on-prem data lake: io.trino.spi.type.DoubleType error

Closed this issue · 6 comments

Hey following along on your on YT guide for on-prem data lake

After creating the table minio.sales.sales_parquet, getting some type error:
SQL Error [65536]: Query failed (#20230901_163330_00034_wbhe5): io.trino.spi.type.DoubleType

Not sure if related, but I am running this on MacOS 11.7.8 (20G1351).

Here's the full stack trace:

java.lang.UnsupportedOperationException: io.trino.spi.type.DoubleType
	at io.trino.spi.type.AbstractType.writeSlice(AbstractType.java:115)
	at io.trino.parquet.reader.BinaryColumnReader.readValue(BinaryColumnReader.java:55)
	at io.trino.parquet.reader.PrimitiveColumnReader.lambda$readValues$2(PrimitiveColumnReader.java:183)
	at io.trino.parquet.reader.PrimitiveColumnReader.processValues(PrimitiveColumnReader.java:203)
	at io.trino.parquet.reader.PrimitiveColumnReader.readValues(PrimitiveColumnReader.java:182)
	at io.trino.parquet.reader.PrimitiveColumnReader.readPrimitive(PrimitiveColumnReader.java:170)
	at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:262)
	at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:314)
	at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:297)
	at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:164)
	at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:381)
	at io.trino.spi.block.LazyBlock$LazyData.getFullyLoadedBlock(LazyBlock.java:360)
	at io.trino.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:276)
	at io.trino.spi.Page.getLoadedPage(Page.java:279)
	at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:304)
	at io.trino.operator.Driver.processInternal(Driver.java:379)
	at io.trino.operator.Driver.lambda$processFor$8(Driver.java:283)
	at io.trino.operator.Driver.tryWithLock(Driver.java:675)
	at io.trino.operator.Driver.processFor(Driver.java:276)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1076)
	at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
	at io.trino.$gen.Trino_351____20230901_154410_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

Check the data stored the data in minIO, the data inside the file may have more columns than defined in the hive table metastore. So when Trino is querying the data, it may find different data type i.e. varchar instead of DoubleType.

This also can happen if one record has a a type different than what is defined in the metastore. Try and delete the data and import it again without that extra unneeded column.

I have uploaded an update file with only columns specified in the table ddl. Try it with the updated file.
https://github.com/hnawaz007/pythondataanalysis/blob/main/data-lake/data/sales_summary_updated.parquet

I see - I will try your file in a second, it is my first time handling parquet files - I was able to open it using DuckDB memory in DBeaver but am unsure how to export it to parquet again.

How did you edit the parquet file?

There are few tools that allow you to view parquet files like Tad and DBeaver. If you would like to edit them then use one of the programing language i.e. Java, Python to read and modify the file.

Ok yes that works, ended up using pyspark. And with the subset of columns this works