exasol/hadoop-etl-udfs

Date/Time types are not supported

ElisaMariaS opened this issue · 7 comments

I'm trying to use this tool to read info from delta files I'm storing in HDFS which contains different types of Date/Time columns (DATE and TIMESTAMP, namely) but I get following errors when trying to use them:

com.exasol.ExaDataTypeException: emit column 'MY_COL' is of unsupported type org.apache.hadoop.hive.common.type.Timestamp

And

com.exasol.ExaDataTypeException: emit column 'MY_COL' is of unsupported type org.apache.hadoop.hive.common.type.Date

Both point to HdfsSerDeImportService.importFile() -> That calls to ExaIteratorImpl.emit() and looking into that code I've found that there mappings for types such as HiveDecimal or HiveVarchar to map to Java standards, so I guess that we are missing the corresponding mappings for Date/Time types.

Hello @ElisaMariaS,

Thanks for reporting this issue!

We use the Hive object inspector utility function to convert the Hive values into Java objects.

But maybe the java.sql.Timestamp and java.sql.Date conversions are skipped. I am going to look into it.

It might have to do with Hive version. I think those types were not present until 3.x.x.

Hello @ElisaMariaS, @castorm

Thanks for your feedback!

Indeed in the 3.1.0 version of Hive the JavaDateObjectInspector returns Hive Date and Timestamp types instead of Java SQL types.

We have added a fix for this issue in the develop branch.

Unfortunately we cannot merge it into master at the moment, since we still have deployments in Cloudera Hadoop Distributions. The latest CDH 6.3.3 version still uses the Hive 2.2.1 and Hadoop 3.0.0 versions.

Please try out the latest commit in develop branch and let us know if there are any issues.

This is probably related. We get following exception when trying to transfer a hive-table (written with spark) that contains DATE columns (works with TIMESTAMP columns though):
VM error: F-UDF-CL-LIB-1127: F-UDF-CL-SL-JAVA-1002: F-UDF-CL-SL-JAVA-1013: com.exasol.ExaUDFException: F-UDF-CL-SL-JAVA-1080: Exception during run java.io.IOException: hdfs://.../my_table/part-00043-b69d1883-55b4-4cbb-84e1-8fe59013ca23-c000.snappy.parquet not a SequenceFile

Our Hive-Version is 1.2.x
Spark-Version is 2.4

Hello @salim7,

Thanks for the feedback!

Do you have more traces of the exception?

From what you have shared, it maybe an issue with a file also. This hdfs://.../my_table/part-00043-b69d1883-55b4-4cbb-84e1-8fe59013ca23-c000.snappy.parquet not a SequenceFile indicates that the file may be corrupted or not finished yet.

@morazow , no, the file is not corrupted. I can read it with spark.read.parquet without any failures. Furthermore, we can reproduce this. As stated, as long as there is a DATE column inside parquet, we get this exception.

Closing this ticket, because the project is discontinued. Please take a look at the possible alternatives:

  1. cloud-storage-extension
  2. Hive Virtual Schema
  3. Impala Virtual Schema