delta-io/connectors

Create a Delta table without specifying the table schema in Hive

cometta opened this issue · 9 comments

May i know is it possible when use hive 'create external table' statement without inputting each column types and let the connector auto infer the table schema or auto read the parquet's metadata?

@zsxwing I see that you have assigned this ticket to yourself. I am just following up on its status.

To give a response myself:

See the delta.io:connectors > Hive > create a table wiki which states

the table schema in the CREATE TABLE statement must be consistent with the underlying Delta metadata. Otherwise, the connector will throw an error to tell you about the inconsistency.

I apologies, I do not know if this feature currently supported by delta itself, however, the underlying parquet library parquet4s from version 2.+ does support reading un-specified projection of parquet file. I am not certain how extensive the support is, but a basic set of data types is supported.
I am currently working on unit test / dependency version stabilization for my pull request. Once this is settled I could look into adding this feature.
It would require Scala 2.12+.

Hi @MironAtHome, may I ask what your use case is here? i.e., what problem are you trying to solve? and what API or functionality is delta-standalone / delta-hive connector currently missing?

@MironAtHome , this is a Hive connector issue. It's not related to Delta Standalone.

I looked at Hive code again and found we may be able to support this by introducing our own Hive SerDe. This is a prototype: https://github.com/delta-io/connectors/compare/master...zsxwing:hive-no-schema?expand=1 But I probably don't have time to work on this right now. If anyone has free time, feel free to pick up the work.

Also cc @YannByron in case you are interested.

@zsxwing if want to support creating table without schema provided, i think there is no necessary to introduce our own serde. maybe just a minor optimization for DeltaStorageHandler.preCreateTable is ok.

update: metadata.Table in Hive force to ask users to provide at least one column. And can't find any configure to skip this check.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java L263.

if this is the fact, a possible way is that: define and pass a dummy column like dummy_col array<string>, and use the real schema to replace this in DeltaStorageHandler.preCreateTable. the creating table sql like:

CREATE TABLE table_name (dummy_col array<string>) STORED BY 'io.delta.hive.DeltaStorageHandler' LOCATION "/path/to/delta";

As per our discussion in #323 , we will continue to investigate the Hive SerDe solution.