pacman82/odbc2parquet

Converted type not written to output file for timestamps without timezone

leo-schick opened this issue · 9 comments

When I connect to a SQL Server database and query data with data type datetime or datetime2, it gets converted to parquet data type BIGINT. But I would expect data type TIMESTAMP (as shown here).

This looks like a custom implementation needs to be done for these two types as well. According to Data Type Support for ODBC Date and Time Improvements, the types SQL_TYPE_TIMESTAMP/SQL_TIMESTAMP struct types should be available to read the datetime information.

This issue is similar to #249

Hello @leo-schick ,

I am missing something here. Sure the physical type for timstamps is INT64, but its logical type should be TIMESTAMP. There is even a test to prove it.

See:

parquet_schema_out(out_str).stdout(contains("OPTIONAL INT64 a (TIMESTAMP(MICROS,false));"));

The test uses datetime2 and not datetime. Did you experience the issue with datetime?

Best, Markus

I experienced the issue with both...

Used odbc2parquet version: 0.13.2

Probably the issue is not the logical type but that Azure Synapse (which I use to read the parquet files in this sample) does not support the logical type but use the converted type (legacy).

I inspected now the parquet files with parquet-tools:

MSSQL column with type date

############ Column(OrderDate) ############
name: OrderDate
path: OrderDate
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Date
converted_type (legacy): DATE
compression: ZSTD (space_saved: 18%)

✔️ Works fine with Azure Synapse, using data type date.

MSSQL column with type datetime

############ Column(CreatedDateTime) ############
name: CreatedDateTime
path: CreatedDateTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: ZSTD (space_saved: 30%)

🛑 Does not work with Azure Synapse. I must read column with data type bigint. datetime and datetime2 fails.

MSSQL column with type datetime2(7)

############ Column(CreatedDateTime) ############
name: CreatedDateTime
path: CreatedDateTime
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: ZSTD (space_saved: 15%)

🛑 Does not work with Azure Synapse. I must read column with data type bigint. datetime and datetime2 fails.

MSSQL column datetimeoffset

############ Column(OrderTimestamp) ############
name: OrderTimestamp
path: OrderTimestamp
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
compression: ZSTD (space_saved: 20%)

✔️ Works fine with Azure Synapse when I use data type datetime2.

It looks to me that Azure Synapse interprets not the logical_type but the converted_type which is probably deprecated. Would it be possible to add the converted_type schema information to the datetime and datetime2 data types as well?

Note: I found the official documentation which as a note about the deprecated converted_type:

Despite there is no exact corresponding ConvertedType for local timestamp semantic, in order to support forward compatibility with those libraries, which annotated their local timestamps with legacy TIMESTAMP_MICROS and TIMESTAMP_MILLIS annotation, Parquet writer implementation must annotate local timestamps with legacy annotations too, as shown below.

Backward compatibility:

ConvertedType LogicalType
TIME_MILLIS TimeType (isAdjustedToUTC = true, unit = MILLIS)
TIME_MICROS TimeType (isAdjustedToUTC = true, unit = MICROS)

Forward compatibility:

LogicalType ConvertedType
TimeType isAdjustedToUTC = true unit = MILLIS TIME_MILLIS
unit = MICROS TIME_MICROS
unit = NANOS -
isAdjustedToUTC = false unit = MILLIS TIME_MILLIS
unit = MICROS TIME_MICROS
unit = NANOS -

Yeah, we should do that. I wonder however if this is not better fixed upstream in the parquet crate. Actually I assumed it would already do that. Maybe I am mistaken and there is a good reason for it not to do it, but asking can not hurt.

Hi @leo-schick ,

thanks again for your thorough Bug report. My current understanding is that by setting the logical type, the converted type should be implied. So I opened an Upstream ticket. As a workaround I tried to set the converted type manually. but it seems to be ignored then I tested with parquet-tools. I currently can not think of another quick workaround, so for the moment I am waiting for feedback on the upstream ticket. Also to verify that my understanding is correct.

If the situtation does persist, I could try introducing a flag to instruct odbc2parquet to set the converted instead of the logical type. Currently my spare time is very rare, so the best course for now I'd say is to move forward with the upstream ticket.

Best Markus

It seems the missing converted type is due to odbc2parquet using INT64 instead of INT32 for Milliseconds precision timestamps. However switching to INT32 the code now fails if setting the logic type. So, still trying to figure this thing out, but working on it.

Hi @leo-schick ,

as stated in the conclusion of apache/arrow-rs#3017 the parquet crate (and therefore also odbc2parquet) do emit a converted type. The bug seems to be in the readers which are based on the C++ implemantion (like pyarrow). Please open a bug ticket there.

Cheers, Markus

Also closing this issue, as the converted type is written into the output by odbc2parquet. Fault seems to lie with readers. Please open a bug there, if you need these to work together with odbc2parquet.