Support for data type timezone conversion to UTC
leo-schick opened this issue · 11 comments
In TSQL/SQL Server, the datetimeoffset
- which is practially datetime with timezone offset - is currently converted into a parquet string. I don't like that because the data stored there is quite large and might be bad to sort.
Postgres has a similar data type called timestamp with time zone
Unfortunately, PARQUET does not support a logical data type which supports time zone, but there are serveral considerations how to deal with different time zones, see here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp One suggestion is to convert all to UTC.
I would like to have the option to tell odbc2parquet that all datetimeoffset/timestamp with time zone data types shall be converted into a UTC TIMESTAMP in the PARQUET export instead of into a PARQUET string. I suggest to add this option with an additional parameter.
The benefit of this is that filtering or sorting on a PARQUET timestamp is much faster than on a PARQUET string.
Hi, thanks for the great user story! I already try to map timestamps as best as I can. I fully empathise with your usecase. You wouldn't happen to know to which ODBC datatypes datetimeoffset
or timestapm with time zone
map?
I need to test / investigate, but I partly fear that these mit be types which are not part of the ODBC standard, but driver specific extensions. If they are part of the standard, it is most likely just an oversight from me not handling them accordingly. Drivery specific types are more tricky. Their meaning and interpretation is not agreed on cross drivers (well, I guess that is what makes them specific). I would be more hesitant to implement logic for that. Not saying no just now, but it is something to carefully think about.
My bandwith is a bit low at the moment, may be a while before I get to investigate this. Don't know about your ODBC or Rust skills, but if you want I can give you pointers how to drive this story a bit.
Otherwise, just querying the table you likely have with a verbose logging enabled -vvv
and sharing the output with me here, would be nice. Especially everything that describes the column in questios.
If you want to go the extra mile use odbcsv
to yield me the column description.
Cheers, Markus
For SQL Server it seems to be a vendor/driver specific implementation with a type called SQL_SS_TIMESTAMPOFFSET
. See Data Type Support for ODBC Date and Time Improvements
. It passes an extra struct which could be used to cast it:
typedef struct tagSS_TIMESTAMPOFFSET_STRUCT {
SQLSMALLINT year;
SQLUSMALLINT month;
SQLUSMALLINT day;
SQLUSMALLINT hour;
SQLUSMALLINT minute;
SQLUSMALLINT second;
SQLUINTEGER fraction;
SQLSMALLINT timezone_hour;
SQLSMALLINT timezone_minute;
} SQL_SS_TIMESTAMPOFFSET_STRUCT;
I couldn't find an offical documentation for Postgres right away.
I will discuss in the team what the best option is to tackle this issue.
I see, this is why the timestamp logic of odbc2parquet
does not trigger. It is just considered an "other" type, and fetched as string. One way of tackling this could be to detect the database system before analyzing the schema. From the odbc-api
test suite we know Microsoft SQL Server will identify itself as such: https://github.com/pacman82/odbc-api/blob/4721fd050b5813021538591810869e05f3039383/odbc-api/tests/integration.rs#L2841
We could then introduce database specify logic to handle "other" types. Just a thought. I'll sleep over it.
👍
If you could find out what the struct for PostgreSQL would look like, we may be able to support that, too.
I could not find any specific documentation about that. I guess because timestamp with time zone
is quite new in PostgreSQL (compared to the ODBC implementation). Before this type was implemented, the default type takes the local time zone and converts it to UTC when it is stored in the DB and it is translated back when the data is read. The closest I could find is this odbc documentation which notes this logic but does not give explecite information about its logic (unfortunately I couldn't find a deep ODBC documentation on https://www.postgresql.org/docs/)
I guess the best way would be to try this out. I currently don't have the time to build a test setup for PostgreSQL with ODBC ...
Hello @leo-schick , odbc2parquet 0.13.0
has been released. It uses instant semantics, if the database is "Microsoft SQL Server" and the relational type is DATETIMEOFFSET
. I'll leave this issue open to check if supporting PostgreSQL is an easy win. MSSQL is already supported with the current release.
Cheers, Markus
Tests with Postgres have shown, that to an ODBC client both TIMESTAMP
and TIMESTAMPTZ
are just presented as ODBC type TIMESTAMP
. However, Postgres seems to return the timestamp as UTC for the latter. I'll explain it better with more time, but in short there is nothing I can or should implement for Postgres.
@pacman82 There is a way to tell postgres to export the data in a specific timezone, e.g. UTC. The psql
command interprets the environment variable PGTZ
to set the time zone for the client. Maybe something comparable is available for ODBC e.g. as ODBC connection parameter? just an idea....
In my tests it seems that Postgres does convert the timestamp to UTC. It is just that from the client side I have no way of telling wether this is instant semantics or not. So the parquet
logical type even is timestamp, but it will not indicate that it is adjusted to UTC
. Overall the postgres ODBC drivers behaviour is nicer for clients who do not have any special treatment for Postgres. The MSSQL chooses to use a custom type, as such the default behivour of clients is unlikely to be satisfactory, but now that I've special code for it, it is overall nicer.
Any ideas for configuring ODBC are better placed with the maintainers of the PostgreSQL ODBC driver. Yet, as written before. At least to me it seems to already do what you want.
@leo-schick Almost forgot to ask. Does the tool work for you now?