Rethink / decide / justify on how to implement equality/hashing/less-than for Annotations

Question

Rethink / decide / justify on how to implement equality/hashing/less-than for Annotations

Closed this issue 4 years ago · 4 comments

johann-petrak commented 4 years ago

Currently:

equality: based on all attributes, including the annotation id
hash: based on annotation id and owner set
- different hashes imply non-equal
less than: based on start offset and annotation id
- it is possible: not ann1 < ann2 and not ann1 > ann2 and not ann1 == ann2
- because equality is based on different attributes than order comparison

Answer 1 · 2021-04-09T12:42:21.000Z

Issues:

in some cases, annotations which only differ by id should be seen equal as their "useful" content is identical
- maybe add a separate equality method for this
order and equality are not consistent, but:
- we do not want to use all fields for ordering
- we do not want to only use the start offset and id for equality (why?)
- why do we define order at all: just for convenience so we can easily sort a list of annotations
- NOTE: if we iterate over the annotations from an annotationset, ordering is handled anyways, so it is only for Python collections that contain annotations.
- may not be worth the confusion to actually have this rather than use an explicit lamda for sorting within python collections?

Answer 2 · 2021-04-26T05:57:39.000Z

Reminder: we need to make sure our definitions of equality and hasing satisfy these constraints:

whenever the objects compare equal, the hashes MUST be equal
- whenever the objects compare unequal, the hashes SHOULD be unequal
hash codes MUST NOT be based on mutable values

For annotations, the feature map is mutable. Since we cannot use the feature map for hashing, we cannot use it for equality either. So our definition of equality should be based on something that is more about the identity of the annotation than the content:

an annotation from an attached set is uniquely identified by owning set and id
an annotation without an owning set is uniquely identified only by its id, i.e. a shallow copy would be a different annotation
however, it is impossible for annotations in an attached set to be shallow copies of each other, so id would be sufficient there as well

Conclusion: go back to using id for equality and hashing and implement separate methods for content-based comparisons.

Answer 3 · 2021-04-26T06:04:19.000Z

This also means that in order for Span objects to implement equality based on the offsets, those offsets must be immutable.

Answer 4 · 2021-04-26T19:44:57.000Z

Annotation now uses default Python equality/hashing
Spans are immutable and use start/end for equality and hashing
the less-than implementation is kept as before for convenience