Converted type not written to output file for timestamps without timezone
leo-schick opened this issue · 9 comments
When I connect to a SQL Server database and query data with data type datetime
or datetime2
, it gets converted to parquet data type BIGINT
. But I would expect data type TIMESTAMP
(as shown here).
This looks like a custom implementation needs to be done for these two types as well. According to Data Type Support for ODBC Date and Time Improvements, the types SQL_TYPE_TIMESTAMP
/SQL_TIMESTAMP
struct types should be available to read the datetime information.
This issue is similar to #249
Hello @leo-schick ,
I am missing something here. Sure the physical type for timstamps is INT64
, but its logical type should be TIMESTAMP
. There is even a test to prove it.
See:
Line 686 in aee2ca5
The test uses datetime2
and not datetime
. Did you experience the issue with datetime
?
Best, Markus
I experienced the issue with both...
Used odbc2parquet version: 0.13.2
Probably the issue is not the logical type but that Azure Synapse (which I use to read the parquet files in this sample) does not support the logical type but use the converted type (legacy).
I inspected now the parquet files with parquet-tools:
MSSQL column with type date
############ Column(OrderDate) ############
name: OrderDate
path: OrderDate
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Date
converted_type (legacy): DATE
compression: ZSTD (space_saved: 18%)
✔️ Works fine with Azure Synapse, using data type date
.
MSSQL column with type datetime
############ Column(CreatedDateTime) ############
name: CreatedDateTime
path: CreatedDateTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: ZSTD (space_saved: 30%)
🛑 Does not work with Azure Synapse. I must read column with data type bigint
. datetime
and datetime2
fails.
MSSQL column with type datetime2(7)
############ Column(CreatedDateTime) ############
name: CreatedDateTime
path: CreatedDateTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: ZSTD (space_saved: 15%)
🛑 Does not work with Azure Synapse. I must read column with data type bigint
. datetime
and datetime2
fails.
MSSQL column datetimeoffset
############ Column(OrderTimestamp) ############
name: OrderTimestamp
path: OrderTimestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
compression: ZSTD (space_saved: 20%)
✔️ Works fine with Azure Synapse when I use data type datetime2
.
It looks to me that Azure Synapse interprets not the logical_type
but the converted_type
which is probably deprecated. Would it be possible to add the converted_type
schema information to the datetime
and datetime2
data types as well?
Note: I found the official documentation which as a note about the deprecated converted_type
:
Despite there is no exact corresponding ConvertedType for local timestamp semantic, in order to support forward compatibility with those libraries, which annotated their local timestamps with legacy
TIMESTAMP_MICROS
andTIMESTAMP_MILLIS
annotation, Parquet writer implementation must annotate local timestamps with legacy annotations too, as shown below.Backward compatibility:
ConvertedType LogicalType TIME_MILLIS TimeType (isAdjustedToUTC = true, unit = MILLIS) TIME_MICROS TimeType (isAdjustedToUTC = true, unit = MICROS) Forward compatibility:
LogicalType ConvertedType TimeType isAdjustedToUTC = true unit = MILLIS TIME_MILLIS unit = MICROS TIME_MICROS unit = NANOS - isAdjustedToUTC = false unit = MILLIS TIME_MILLIS unit = MICROS TIME_MICROS unit = NANOS -
Yeah, we should do that. I wonder however if this is not better fixed upstream in the parquet
crate. Actually I assumed it would already do that. Maybe I am mistaken and there is a good reason for it not to do it, but asking can not hurt.
Hi @leo-schick ,
thanks again for your thorough Bug report. My current understanding is that by setting the logical type, the converted type should be implied. So I opened an Upstream ticket. As a workaround I tried to set the converted type manually. but it seems to be ignored then I tested with parquet-tools. I currently can not think of another quick workaround, so for the moment I am waiting for feedback on the upstream ticket. Also to verify that my understanding is correct.
If the situtation does persist, I could try introducing a flag to instruct odbc2parquet
to set the converted instead of the logical type. Currently my spare time is very rare, so the best course for now I'd say is to move forward with the upstream ticket.
Best Markus
It seems the missing converted type is due to odbc2parquet
using INT64
instead of INT32
for Milliseconds precision timestamps. However switching to INT32
the code now fails if setting the logic type. So, still trying to figure this thing out, but working on it.
Hi @leo-schick ,
as stated in the conclusion of apache/arrow-rs#3017 the parquet
crate (and therefore also odbc2parquet
) do emit a converted type. The bug seems to be in the readers which are based on the C++ implemantion (like pyarrow
). Please open a bug ticket there.
Cheers, Markus
Also closing this issue, as the converted type is written into the output by odbc2parquet
. Fault seems to lie with readers. Please open a bug there, if you need these to work together with odbc2parquet
.