pacman82/odbc2parquet

Support MSSQL data type TIME

leo-schick opened this issue · 10 comments

Currencly the data type time from MSSQL is exported as BYTE_ARRAY, UTF8, String:

Column description from parquet-tools:

############ Column(OrderTime) ############
name: OrderTime
path: OrderTime
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD (space_saved: 62%)

I would have expected a parquet TIME type in logical_type/converted_type.

Support for time is still missing. I had a branch two years ago (master...time), yet never merged it, because I could not find a column type which would identify itself as 92 SQL_DATA_TIME. The time data type of an MSSQL tables seems to be another custom type and identifies as -154. So this seems to require custom code for MSSQL. This time though I can not fathom why the default SQL Type would not do.

Valid feature request might take a while till I get to it, though.

You are right, MSSQL uses a custom type as described here: https://learn.microsoft.com/en-us/sql/relational-databases/native-client-odbc-date-time/data-type-support-for-odbc-date-and-time-improvements?view=sql-server-ver16

The type is called SQL_SS_TIME2 and has the following structure:

typedef struct tagSS_TIME2_STRUCT {  
   SQLUSMALLINT hour;  
   SQLUSMALLINT minute;  
   SQLUSMALLINT second;  
   SQLUINTEGER fraction;  
} SQL_SS_TIME2_STRUCT;

odbc2parquet 0.14.0 is released which maps TIME to Time Nano. Mapping it to Micro or Milli depeding on precision seem to triggers upstream not impelmented errors.

@leo-schick Does odbc2parquet 0.14.0 resolve your issue?

Closing this for now

Hi @pacman82 ,

sorry for the late response. Was quite busy with some other tasks.

Unfortunately, this seems to not work as it should. I am not 100% sure how this should be solved tho...

Here my validation results:

I have a table with a SQL time
image

After upgrading, the parquet-tools shows now that logical_type is time:

############ Column(OrderTime) ############
name: OrderTime
path: OrderTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Time(isAdjustedToUTC=false, timeUnit=nanoseconds)
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 11%)

Reading with Apache Spark

When I use Apache Spark 3.3.0 to read it as a SQL TIMESTAMP type (data type TIME is not supported inside Apache Spark SQL) I get this for the same record:
image

Just wonder how this gets messed up.

Reading with Microsoft Synapse

When I use Microsoft Synapse, I get the following error message:

Column 'OrderTime' of type 'TIME' is not compatible with external data type 'Parquet physical type: INT64', please try with 'BIGINT'. File/External table name: '<table_name>'.

Probably because the converted_type (legacy) is missing.

I think we should find another way to solve this. I mean, Apache Spark is quite famous and it should at least work there correctly.

p.s. I just noted that SQL data type time is by default time(7) which is 100 ns precision. Using Time(.., timeUnit=nanoseconds) is then correct and mapping converted_type (legacy) to None is correct as well, see here.

Then the remaining question is then why we read it with Apache Spark....

Hi @leo-schick , thanks for the response.

Then the remaining question is then why we read it with Apache Spark

Do you mean 'why' or 'how'?

I wonder if odbc2parquet should offer a flag to choose microseconds precision, or even if it should do so by default. What do you think @leo-schick ?

Do you mean 'why' or 'how'?
I meant how. Or better how do we get “time” the best way into the parquet file so that we can read it correctly in Apache Spark.

I am not yet sure about the flag for microseconds precision. I would propose a flag which tries to always convert to a converted_type of possible. I think I can build a way around it for me as long as it works in Apache Spark. IMO converting of data should not be part of the odbc2parquet tool - maybe as an option when you like to implement it but I will not use it I think.