malthe/pq

Hardwired json.dumps introduces double encoding problem

Closed this issue · 7 comments

When doing a put() to a Queue, there is a nested call to two different dumps functions:

https://github.com/malthe/pq/blob/master/pq/__init__.py#L226

The problematic expression is dumps(self.dumps(data)). The outer dumps is the stdlib json.dumps and the inner dumps is either the default lambda that does nothing, or a lambda that pickles the object.

There is a more pythonic approach than this double-dumping. A common pattern in python is to provide a json.JSONEncoder subclass that can encode, for example, datetime objects or even pickles for that matter. Currently, the datetime must first be encoded by hand before being passed to put. Because the outer json.dumps is hardwired to the stdlib function, it cannot be overridden with a different JSONEncoder subclass.

A cleaner implementation would be to remove the double dumps and simply support passing in JSONEncoder subclasses. The default json.JSONEncoder would exactly mimic the current default, and a subclass, say, PickleJSONEncoder would implement an encode(obj) function that did exactly what the pickling lambdas do now. With this approach, I or anyone else could pass in a custom JSONEncoder that suited our needs.

Here is an example PickleEncoder that mimics the exact output of the way pq does it now:

import pickle
from json import dumps, JSONEncoder


class PickleEncoder(JSONEncoder):

    def encode(self, obj):
        return dumps(pickle.dumps(obj, 0).decode('latin-1'))

The idea with the current design is that you can subclass Queue and implement your own loads and dumps methods or even just pass callables directly to the PQ constructor using the loads and dumps keyword arguments.

The built-in json module has an object hook mechanism that can be used to encode and decode a DateTime object, but you might want to use a more optimized JSON module as well.

The "double encoding" option is really a feature. If you have complex objects that you don't want to implement custom JSON object hooks for, then you can just use the pickle format and not have to do more work. In terms of performance then it is two-fold: we're using a slower pickle type (version 0) and we're JSON-escaping the output string. If you need better performance then you must override the loads and dumps callables and provide custom serialization for your objects.

The idea with the current design is that you can subclass Queue and implement your own loads and dumps methods or even just pass callables directly to the PQ constructor using the loads and dumps keyword arguments.

There is no way to override the outer dumps function in the expression dumps(self.dumps(data)) on line 226 of init.py. Yes, you can override the inner dumps, but not the outer, so the only option is to have the inner dumps return a string which the outer dumps just reserializes as one big json string. This makes the payload pretty much completely useless on the SQL side of things, being a json string that contains escaped json. This is the double encoding. Every payload must be serialized and deserialized twice and the payload is effectively inaccessible from SQL.

The PR I submitted removes this double encoding, it's more efficient, the payload is only encoded once, and because it's actual json, not json escaped into a big json string, the payload is usable from sql using postgres default json operators.

I have taken the liberty to take a different stab at the issue in the referenced pull request. This also circumvents the database driver from doing the JSON decoding by itself.

Ok I can work with that, it does sidestep the double encoding issue and allows the data to be sql accessible, so it works for me. Thanks!

Great. I'll merge the request. It's not that your proposed fix wouldn't work, I just think it introduced more code than necessary.

This feature is now available in release 1.8.1.