Support MSSQL data type TIME
leo-schick opened this issue · 10 comments
Currencly the data type time from MSSQL is exported as BYTE_ARRAY, UTF8, String:
Column description from parquet-tools:
############ Column(OrderTime) ############
name: OrderTime
path: OrderTime
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD (space_saved: 62%)
I would have expected a parquet TIME type in logical_type
/converted_type
.
Support for time
is still missing. I had a branch two years ago (master...time), yet never merged it, because I could not find a column type which would identify itself as 92 SQL_DATA_TIME
. The time
data type of an MSSQL tables seems to be another custom type and identifies as -154
. So this seems to require custom code for MSSQL. This time though I can not fathom why the default SQL Type would not do.
Valid feature request might take a while till I get to it, though.
You are right, MSSQL uses a custom type as described here: https://learn.microsoft.com/en-us/sql/relational-databases/native-client-odbc-date-time/data-type-support-for-odbc-date-and-time-improvements?view=sql-server-ver16
The type is called SQL_SS_TIME2
and has the following structure:
typedef struct tagSS_TIME2_STRUCT {
SQLUSMALLINT hour;
SQLUSMALLINT minute;
SQLUSMALLINT second;
SQLUINTEGER fraction;
} SQL_SS_TIME2_STRUCT;
odbc2parquet 0.14.0
is released which maps TIME
to Time Nano. Mapping it to Micro or Milli depeding on precision seem to triggers upstream not impelmented errors.
@leo-schick Does odbc2parquet 0.14.0
resolve your issue?
Closing this for now
Hi @pacman82 ,
sorry for the late response. Was quite busy with some other tasks.
Unfortunately, this seems to not work as it should. I am not 100% sure how this should be solved tho...
Here my validation results:
I have a table with a SQL time
After upgrading, the parquet-tools shows now that logical_type is time:
############ Column(OrderTime) ############
name: OrderTime
path: OrderTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Time(isAdjustedToUTC=false, timeUnit=nanoseconds)
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 11%)
Reading with Apache Spark
When I use Apache Spark 3.3.0 to read it as a SQL TIMESTAMP
type (data type TIME
is not supported inside Apache Spark SQL) I get this for the same record:
Just wonder how this gets messed up.
Reading with Microsoft Synapse
When I use Microsoft Synapse, I get the following error message:
Column 'OrderTime' of type 'TIME' is not compatible with external data type 'Parquet physical type: INT64', please try with 'BIGINT'. File/External table name: '<table_name>'.
Probably because the converted_type (legacy)
is missing.
I think we should find another way to solve this. I mean, Apache Spark is quite famous and it should at least work there correctly.
p.s. I just noted that SQL data type time
is by default time(7)
which is 100 ns precision. Using Time(.., timeUnit=nanoseconds)
is then correct and mapping converted_type (legacy) to None is correct as well, see here.
Then the remaining question is then why we read it with Apache Spark....
Hi @leo-schick , thanks for the response.
Then the remaining question is then why we read it with Apache Spark
Do you mean 'why' or 'how'?
I wonder if odbc2parquet
should offer a flag to choose microseconds precision, or even if it should do so by default. What do you think @leo-schick ?
Do you mean 'why' or 'how'?
I meant how. Or better how do we get “time” the best way into the parquet file so that we can read it correctly in Apache Spark.
I am not yet sure about the flag for microseconds precision. I would propose a flag which tries to always convert to a converted_type of possible. I think I can build a way around it for me as long as it works in Apache Spark. IMO converting of data should not be part of the odbc2parquet tool - maybe as an option when you like to implement it but I will not use it I think.