internal sharing of RangeDict values can cause debugging headaches... fix or clarify in documentation?
jgm-ktg opened this issue · 4 comments
from ranges import Range, RangeDict
from operator import ior
d=RangeDict()
d[Range(1,2)]=set([3])
d[Range(3,4)]=set([3])
ior(d[Range(1,2)],set([4])) # e.g. as when underlying the use of the |= operator
print(d)
...produces...
{{[1, 2), [3, 4)}: {3, 4}}
...which is clearly not expected behavior. By returning values that are shared with other ranges internally, operators like |= become sources of bugs that are tough to discover.
How can we either indicate that code like the following is sometimes necessary or else change the module behavior to make it unnecessary? I assume there is some good reason that objects are compacted/shared internally?
from ranges import Range, RangeDict
from copy import copy
oldgi=RangeDict.__getitem__
def newgi(this,key):
return copy(oldgi(this,key))
RangeDict.__getitem__ = newgi
d=RangeDict()
d[Range(1,2)]=set([3])
d[Range(3,4)]=set([3])
d[Range(1,2)] |= set([4])
print(d)
{{[3, 4)}: {3}, {[1, 2)}: {3, 4}}
This is a bug, thanks for reporting it.
The main reason it's happening is that, to improve performance and simplify deletion, RangeDict
correspondences are linked in both directions (i.e. there's a mapping of values to keys in there along with the expected mapping of keys to values). So if multiple keys share the same value, they get grouped together into a single RangeSet
which then becomes a single key.
The direct cause, then, is probably that the comparison internally is using ==
rather than is
, which causes this to happen to set
values in particular (though it would happen to int
values either way, so that behavior is still wrong and needs to be fixed). The underlying cause, though, probably has to do with __setattribute__()
not splitting a range if only part of it is assigned to a new value. I'll look into it.
So if multiple keys share the same value, they get grouped together into a single RangeSet which then becomes a single key.
I think this may be appropriate, it just leads to some unintuitive behavior. I almost wonder whether there needs to be an option to return copies of internally shared objects rather than the objects themselves, and perhaps this could be the default? Then someone seeking higher performance who understands the nuances could turn it off in RangeDict init?
EDIT: another idea would be to allow the user to decide in init whether to use "is" or "==" ... I would lean towards giving the user as many options as possible, and making sure the default ones yield the fewest functional surprises even if it slows things down
d=RangeDict()
d[Range(1,2)]={'a':set([1]),'b':set([2])}
d[Range(3,4)]={'a':set([1]),'b':set([2])}
print(d)
d[Range(2,3)]={'a':set([1]),'b':set([2])}
print(d)
{{[1, 2), [3, 4)}: {'a': {1}, 'b': {2}}}
{{[1, 4)}: {'a': {1}, 'b': {2}}}
To expand on the above, I would not want to lose the option of compression based on ==
Added an optional kwarg identity
to the RangeDict
constructor which can toggle this behavior on or off (using identity instead of equality to determine when two keys are pointing to the same value). This should resolve the issue.