mogui/pyorient

Problem with encoding dates in serialization.py

Opened this issue · 13 comments

I think the accepted standard of OrientDB is to write and read dates in POSIX/Unix time which means they should be encoded and decoded as if they occurred at UTC midnight. If you use the OrientDB console:

CREATE CLASS Event EXTENDS V  
CREATE VERTEX Event SET name='Example', date=Date('1970-01-01 00:00:00')
SELECT @RID, name, date.asLong() FROM Event

You get as expected 0L for the long value stored.

In serializations.py dates are encoded and decoded using local time. This means if the client using pyorient reading a date field is in a more western timezone than who wrote the value using pyorient, they get a different date. (i.e. If you encode 1970-01-01 with current serialization.py in New York it actually writes, 18000000 to the DB. If you then decode that with pyorient in California as a date, you get 1969-12-31). Note in the process pyorient discards the time portion of the timestamp, meaning the dates are truly different.

I can see why this may evade tests, as if the test process that is writing data and the test process reading the data are in the same timezone they will return the same value. (even if still the value stored in the DB is technically wrong). Any tests on dates should be updated to check what the long value stored in the DB is, i.e. Writing 1970-01-01 should store 0L.

Both the encoding and the decoding need to be fixed:

Decoding (lines 322:324 of serializations.py) should change from:

        if c == 'a':
            collected = date.fromtimestamp(float(collected) / 1000)
            content = content[1:]

to

        if c == 'a':
            collected = datetime.utcfromtimestamp(float(collected) / 1000).date()
            content = content[1:]

and encoding (lines 131:132 of serializations.py) should change from:

        elif isinstance(value, date):
            ret = str(int(time.mktime(value.timetuple())) * 1000) + 'a'

to

        elif isinstance(value, date):
            ret = str(int(calendar.timegm(value.timetuple())) * 1000) + 'a'

For the later change, you'll also need to import calendar. calendar.timegm() assumes the date is UTC, time.mktime() assumes its local time. They return 0 (correct) and 18000000 (incorrect for Orient) respectively.

I had the same problem with datetime. I followed your instructions also for 't' (datetime) and not only for 'a' (date):
after
def _encode_value(self, value):
add
from calendar import timegm
and then change
ret = str(int(time.mktime(value.timetuple())) * 1000) + 't'
to
ret = str(int(timegm(value.timetuple())) * 1000) + 't'
and change
collected = datetime.fromtimestamp(float(collected) / 1000)
to
collected = datetime.utcfromtimestamp(float(collected) / 1000)

Hi, i studied your suggestions and i found that these fix doesn't works in all cases.

For example, when a sql string is used instead of an Date object:

CREATE CLASS dt EXTENDS V
CREATE PROPERTY dt.name String
CREATE PROPERTY dt.at Date
CREATE VERTEX dt SET name="today",at="2016-04-10"

The date is stored in Long format inside OrientDB as Date representation, but when it is decoded with this patch by the client it becomes "2016-04-09". Both forms ( String with property Date and the Date Object ) are sent to the client in UnixTime format so is not possible to distinguish between them.

So, i think that the responsibility of writing and decoding the dates in the right Timezone and format, must be delegated to the applications .

In the same manner, if i think to other famous relational databases ( MySQL/Oracle ) i see that they are not aware of the data they store.

Sorry, i mark this as won't fix.

"i think that the responsibility of writing and decoding the dates in the right Timezone and format, must be delegated to the applications ." I 100% agree, but pyorient doesn't let you do that. It forces both the read and write client to write and decode using the local timezone.(by using time.mktime and date.fromtimestamp ) And currently that means it is impossible to use pyorient to store dates if you might have clients reading that are west of you. All dates will be wrong. Or do you have a suggestion on how to use pyorient to do this? The only way I can think of is very hacky, give every date a separate timezone property for where it was written that also has to be read and if you are reading it further west you increment the date returned by pyorient by one.

Also quick question, did you try your test without my patch (i.e. normal pyorient). I think it also fails if the reader is west of the writer.

Surely the datetime management could be improved, but the fix is not so simple because of different date handling between strings and date objects.

I think that your results are not good because the date=Date('1970-01-01 00:00:00') is evaluated at server side and calculated in the timezone of the server. When you retrieve it, the offset changed at your side because of the pyorient.

I suggest you to encode the dates at client side and insert them as string.

As response to your question, yes i tried with your patch and this test fails:
https://github.com/mogui/pyorient/blob/master/tests/test_ogm.py#L418

Any suggestion is welcome

kived commented

I'm having a problem with this too, and I have to say that the pyorient behavior is incorrect. First, OrientDB can be schemaless or only have partial schema. Considering that dates are commonly found in various types of data, and those types of data may not have schema in your database, this creates a vast difference in handling (more than just the obvious difference of converting a string to a datetime object). In my case, I have a schema with a datetime field, and a data field containing additional JSON. The datetime field is converted to local time, while the nested field remains UTC.

Second, OrientDB's own Studio client does not have this problem. If I run a query in OrientDB Studio, I'll get:

{'start_time': '2016-08-12 21:16:54',
 'data': {'STD': '2016-08-12T21:18:29',
          'STA': '2016-08-12T21:33:29'}}

..while with pyorient I get:

{'start_time': '2016-08-12T14:16:54',
 'data': {'STD': '2016-08-12T21:18:29',
          'STA': '2016-08-12T21:33:29'}}
kived commented

FYI - as a workaround, since my application is an API and all datetimes returned to the client should be UTC anyway, I just set that as the default at the top of my database Python file:

from os import environ
from time import tzset

environ['TZ'] = 'UTC'
tzset()

This works on OS X, and should work on Linux (or any *nix). Might not work on Windows. Has other effects, of course - like logging timestamps are in UTC - but should be tolerable.

There are two issues here:

  1. Round-trip of Python date(time) objects
  2. String-encoded date(time)s

The CSV serializer can handle round-trips of date(time) objects, so long as encode and decode assume the same time zone. Whether this timezone is local or UTC doesn't matter, no matter what the server timezone is.

pyorient_native wasn't handling round-trips (nikulukani/pyorient_native#3), as encode and decode assumed different time zones (local and UTC, respectively).

With string-encoded date(time)s, the server timezone matters. OrientDB's default server timezone is the local time-zone.

We could run into problems with string-encoded date-times in a few ways. Two that are relevant here:

  1. The client is in a different time-zone from the server
  2. The serializers use UTC, and the server time-zone is not GMT/UTC/Zulu

The first cause is inevitable with multiple clients in varied time zones. Using UTC for encode/decode, the second cause, was proposed as a solution for the first.

Ideally, we want to allow a scenario where pyorient clients and an OrientDB server can happily coexist in multiple time-zones. To support this scenario with both Python date(time) objects and string-encoded date(time)s, two conditions must be met:

  1. The serializers use UTC
  2. The server time-zone is GMT/UTC/Zulu

That the CSV serializer uses the local time zone is a nice default, since it will get by with OrientDB's default. But this is not tenable with real-world scenarios.

The server time zone is configurable (ALTER DATABASE TIMEZONE "<tzname>"). The pyorient serializer(s) assumed time-zone should also be configurable (I suggest defaulting to local, for both CSV and binary serializers). I think I'll work on this for the CSV serializer, so long as @Ostico is happy with this. Otherwise we're left with the messy choice this issue highlights:

  • Using the local time-zone: works on a small scale but inevitably broken for likely real-world use
  • Using UTC: unlikely to work out-of-the-box, but at least any problems can be avoided through server config.

By having this be configurable, we also get away from encouraging workarounds like that proposed by @kived. It generally being good practice to avoid unpredictable global state.

The CSV serializer seems to fail with the roundtrip encoding of dates (atleast with OrientDB 2.2.5). I will write a script that tests both roundtrip (with possibly different timezones) and string encoded dates to get a better idea of the behavior.

It seems using UTC for serializing/deserializing seems to work for all scenarios. Check the output https://goo.gl/c1LuzG generated by the script https://goo.gl/ViLtk8

The script was ran against pyorient and pyorient_native develop branches. The OrientDB server (2.2.5) had a UTC+4 timezone setting.

@nikulukani datetime.date.today() is not sufficient for testing round trips, since it doesn't include time.

I am just testing the encoding and decoding of dates with that script. Datetimes are considered a different data type by OrientDB and they are handled correctly by both CSV and Binary serializers, in my opinion.

Just went through some code on the OrientDB side to see how they handle dates and realized that the serialization/deserialization behavior for Binary and CSV is different on the server side. The server behavior for date objects when using binary serialization essentially resolves this issue (lines 426, 717 and 998-1012 from
https://github.com/orientechnologies/orientdb/blob/2.2.x/core/src/main/java/com/orientechnologies/orient/core/serialization/serializer/record/binary/ORecordSerializerBinaryV0.java#L426)

For CSV serialization/deserialization, the OrientDB server does not perform the conversion from/to <date> 00:00:00 local time to/from <date> 00:00:00 UTC. (lines 638-645 and 474 from https://github.com/orientechnologies/orientdb/blob/7850712aafb3cb7c61a5c2865710019df0a7e8c9/core/src/main/java/com/orientechnologies/orient/core/serialization/serializer/record/string/ORecordSerializerStringAbstract.java#L638).

The latter is what's causing the issue reported by OP and reflected in the results I posted. Moreover, the server sets the time to 00:00:00 during deserialization after converting the UNIX time sent over to the database time zone. This behavior is honestly baffling to me and very problematic. This essentially can result in +/- 1 or 2 days difference in encoded/decoded values when the server is not in the same time zone as the client. If clients (applications) do know the database timezone, the best way to avoid problems is to send over the date as UNIX time corresponding to <date> 00:00:00 <database time zone> and decode it using the same scheme when using CSV serialization.

Given the above, to mitigate this issue, I agree with the proposal to have the option of specifying a non-default encoding and decoding time zone for Dates when using CSV serialization. For DateTime, I cannot foresee a situation where it would be useful, but having the option can't hurt!

Finally, on a related but unrelated note, it seems that OrientDB stores dates as UNIX time corresponding to <date> 00:00:00 <DB timezone>. I am not sure if they process all dates in the database when the database timezone is changed with ALTER database <timezone>, but I would venture that they do not. This means that the date values stored in the DB (or at least their interpretation) will change if the database timezone is changed after a record was created. As such, I would suggest caution with using ALTER database <timezone> command if you have records with date fields already in OrientDB.