eve-val/evelink

cache bug with python 3

bastianh opened this issue · 4 comments

I noticed that caching won't work correctly on python 3 and tracked it down to the hash() function used in def _cache_key()

The problem is that beginning with python 3.3 hash randomization is enabled by default, so with every launch of the app (or with more then one process) the cache key is different even for the same api calls.
https://docs.python.org/3/reference/datamodel.html#object.__hash__

It's possible to change the behavior with an environment variable, but perhaps it's better to use something from hashlib ?

Hm. I definitely support moving to another hash option which is deterministic in all Pythons and across invocations; that said, I'm not sure if a hashlib algorithm is the right choice here, since we definitely don't need cryptographic security in this case but instead want a fast hash function.

That said, Python doesn't seem to have any fast hash functions baked in (aside from hash(), which obviously isn't the solution to itself), so perhaps a crypto hash is the best option that doesn't needlessly pull in additional library code. I suppose we're not hashing that many values.

hmm.. maybe adler32 ? it is part of the zlib module we already use and I think it's quite fast.

https://docs.python.org/2/library/zlib.html?highlight=adler32#zlib.adler32

edit: I compared some functions.. while hashlib is slower I don't think it don't really matter in our case... ,)
guess we could safe much more time if we would cache the parsed result of the api call instead of the unparsed xml.

from zlib import adler32
from hashlib import md5, sha1

def test_hash(x):
    return hash(x)

def test_adler32(x):
    return adler32(x)

def test_md5(x):
    return md5(x).digest()

def test_sha1(x):
    return sha1(x).digest()


%timeit test_hash("Nobody inspects the spammish repetition")
10000000 loops, best of 3: 145 ns per loop
%timeit test_adler32("Nobody inspects the spammish repetition")
1000000 loops, best of 3: 335 ns per loop
%timeit test_md5("Nobody inspects the spammish repetition")
1000000 loops, best of 3: 1.33 µs per loop
%timeit test_sha1("Nobody inspects the spammish repetition")
1000000 loops, best of 3: 1.41 µs per loop

Adler32 is a checksum, not a hash...I'm not sure it has the distribution
properties you want for dictionary keys.
On Jun 18, 2014 1:13 AM, "Bastian Hoyer" notifications@github.com wrote:

hmm.. maybe adler32 ? it is part of the zlib module we already use and I
think it's quite fast.

https://docs.python.org/2/library/zlib.html?highlight=adler32#zlib.adler32


Reply to this email directly or view it on GitHub
#168 (comment).

yeah, that's why I'd prefer something from hashlib