GateNLP/python-gatenlp

Rethink / decide / justify on how to implement equality/hashing/less-than for Annotations

Closed this issue · 4 comments

Currently:

  • equality: based on all attributes, including the annotation id
  • hash: based on annotation id and owner set
    • different hashes imply non-equal
  • less than: based on start offset and annotation id
    • it is possible: not ann1 < ann2 and not ann1 > ann2 and not ann1 == ann2
    • because equality is based on different attributes than order comparison

Issues:

  • in some cases, annotations which only differ by id should be seen equal as their "useful" content is identical
    • maybe add a separate equality method for this
  • order and equality are not consistent, but:
    • we do not want to use all fields for ordering
    • we do not want to only use the start offset and id for equality (why?)
    • why do we define order at all: just for convenience so we can easily sort a list of annotations
    • NOTE: if we iterate over the annotations from an annotationset, ordering is handled anyways, so it is only for Python collections that contain annotations.
    • may not be worth the confusion to actually have this rather than use an explicit lamda for sorting within python collections?

Reminder: we need to make sure our definitions of equality and hasing satisfy these constraints:

  • whenever the objects compare equal, the hashes MUST be equal
    • whenever the objects compare unequal, the hashes SHOULD be unequal
  • hash codes MUST NOT be based on mutable values

For annotations, the feature map is mutable. Since we cannot use the feature map for hashing, we cannot use it for equality either. So our definition of equality should be based on something that is more about the identity of the annotation than the content:

  • an annotation from an attached set is uniquely identified by owning set and id
  • an annotation without an owning set is uniquely identified only by its id, i.e. a shallow copy would be a different annotation
  • however, it is impossible for annotations in an attached set to be shallow copies of each other, so id would be sufficient there as well

Conclusion: go back to using id for equality and hashing and implement separate methods for content-based comparisons.

This also means that in order for Span objects to implement equality based on the offsets, those offsets must be immutable.

  • Annotation now uses default Python equality/hashing
  • Spans are immutable and use start/end for equality and hashing
  • the less-than implementation is kept as before for convenience