Superbird11/ranges

internal sharing of RangeDict values can cause debugging headaches... fix or clarify in documentation?

Closed this issue · 4 comments

from ranges import Range, RangeDict
from operator import ior
d=RangeDict()
d[Range(1,2)]=set([3])
d[Range(3,4)]=set([3])
ior(d[Range(1,2)],set([4])) # e.g. as when underlying the use of the |= operator
print(d)

...produces...

{{[1, 2), [3, 4)}: {3, 4}}

...which is clearly not expected behavior. By returning values that are shared with other ranges internally, operators like |= become sources of bugs that are tough to discover.

How can we either indicate that code like the following is sometimes necessary or else change the module behavior to make it unnecessary? I assume there is some good reason that objects are compacted/shared internally?

from ranges import Range, RangeDict
from copy import copy

oldgi=RangeDict.__getitem__
def newgi(this,key):
    return copy(oldgi(this,key))
RangeDict.__getitem__ = newgi

d=RangeDict()
d[Range(1,2)]=set([3])
d[Range(3,4)]=set([3])
d[Range(1,2)] |= set([4])
print(d)
{{[3, 4)}: {3}, {[1, 2)}: {3, 4}}

This is a bug, thanks for reporting it.

The main reason it's happening is that, to improve performance and simplify deletion, RangeDict correspondences are linked in both directions (i.e. there's a mapping of values to keys in there along with the expected mapping of keys to values). So if multiple keys share the same value, they get grouped together into a single RangeSet which then becomes a single key.

The direct cause, then, is probably that the comparison internally is using == rather than is, which causes this to happen to set values in particular (though it would happen to int values either way, so that behavior is still wrong and needs to be fixed). The underlying cause, though, probably has to do with __setattribute__() not splitting a range if only part of it is assigned to a new value. I'll look into it.

So if multiple keys share the same value, they get grouped together into a single RangeSet which then becomes a single key.

I think this may be appropriate, it just leads to some unintuitive behavior. I almost wonder whether there needs to be an option to return copies of internally shared objects rather than the objects themselves, and perhaps this could be the default? Then someone seeking higher performance who understands the nuances could turn it off in RangeDict init?

EDIT: another idea would be to allow the user to decide in init whether to use "is" or "==" ... I would lean towards giving the user as many options as possible, and making sure the default ones yield the fewest functional surprises even if it slows things down

d=RangeDict()
d[Range(1,2)]={'a':set([1]),'b':set([2])}
d[Range(3,4)]={'a':set([1]),'b':set([2])}
print(d)
d[Range(2,3)]={'a':set([1]),'b':set([2])}
print(d)
{{[1, 2), [3, 4)}: {'a': {1}, 'b': {2}}}
{{[1, 4)}: {'a': {1}, 'b': {2}}}

To expand on the above, I would not want to lose the option of compression based on ==

Added an optional kwarg identity to the RangeDict constructor which can toggle this behavior on or off (using identity instead of equality to determine when two keys are pointing to the same value). This should resolve the issue.