aws-samples/amazon-textract-response-parser

List of relationship.ids should likely be a set.

cmyers009 opened this issue · 1 comments

The list of relationship.ids contain no duplicates.
On the line relationship.ids.extend(x for x in ids if x not in relationship.ids), you must iterate through the entire list to make sure there are no duplicates before you append the item to the list. If you use python's built in set datatype, you can accomplish the same behavior with a o(1) time per item adding to the list instead of o(n) with n=length of relationship.ids.

https://github.com/aws-samples/amazon-textract-response-parser/blob/dd1ce01d5c63b394af26510d7df72d58e80d136c/src-python/trp/trp2.py#LL354C85-L354C85

def add_ids_to_relationships(self, ids: List[str], relationships_type: str = "CHILD"):
    """Only adds id if not already existing"""
    relationship = self.get_relationships_for_type(relationship_type=relationships_type)
    if relationship:
        if not relationship.ids:
            relationship.ids = list()
            relationship.ids.extend(ids)
        else:
            relationship.ids.extend(x for x in ids if x not in relationship.ids)
    else:
        # empty, set base
        if not self.relationships:
            self.relationships = list()
        self.relationships.append(TRelationship(type=relationships_type, ids=ids))

Hi & thanks for raising this,

Although relationships should contain no duplicate IDs, the ordering is generally important and so the built-in set type (which does not guarantee order) wouldn't really work for us.

Alternatives I'm aware of include:

  • Using dict with only keys (also risky territory - I think a lot of people would like dict not to guarantee order either)
  • Taking an extra dependency (which we're reluctant to do as we manage to be very light right now)
  • Building something custom like this

...all of which would be non-standard in some way and therefore potentially confusing to users. Since .relationships is an exposed property in the API, changing its type would likely also constitute a breaking change.

As a result, I don't think we see sufficient pressure on the performance to move ahead with such a change at this time.

However, if folks are having real challenges with performance in a use-case, and can demonstrate that this is a significant blocking factor, then maybe we could revisit.