networktocode/diffsync

Refactor parent/child relationship

jamesharr opened this issue · 2 comments

Environment

  • DiffSync version: 1.9.0

Proposed Functionality

Currently in DiffSync, the tree hierarchy has shown a few limitations, most notably that objects can not form a tree using the same type. For example, a user cannot form a hierarchy of Locations (Site -> Building -> Floor -> Room). Additionally, this parent/child relationship must be declared when creating the Models, which might not be entirely necessary.

Adapting to a new tree implementation may resolve several outstanding feature requests, such as #239 and #225, as well as allow for the implementation of #122 and a get_parent() method which does not exist today.

This change proposes:

  • Declare parent/child object relationships when an object is added to the backend instead of declared during type declaration.
  • (Implementation detail) Store parent/child relationships in the back-end only instead of in DiffSyncModel
  • Add a get_parent method
  • Eliminate the need for a call to obj1.add_child(obj2)
  • Support arbitrary object hierarchies such as:
    • Self-referencing models, such as Location trees (Location in a Location in a Location)
    • Objects with an optional parent

Use Case

# Note that MyModelA & MyModelB declarations would not include _children

be1 = MyBackEnd()
a1 = MyModelA(...)
a2 = MyModelA(...)
b3 = MyModelB(...)
b4 = MyModelB(...)

be1.add(a1, parent=None)  # Note that parent=None is the default
be1.add(a2, parent=a1)  # Specify a parent object
be1.add(b3, parent=a2)  # Specify a parent
be1.add(b4, parent=None)  # Another object of the same type, but without a parent

# Getter methods
be1.get_parent(a2)  # Will return a1
be1.get_parent(a1)  # Will return None; though it could also raise an Exception
be1.get_children(a2)  # Will return something that iterable (that happens to only have b3 in it)
be1.get_children(b4)  # Will return an empty iterator/list

# Alternate signatures to support Model + ids style signatures
# IE when you don't have the original object.
be1.add(a2, parent_model=MyModelA, parent_ids={"k": "v", ...})
be1.get_parent(model=MyModelA, ids={"k": "v", ...})
be1.get_children(model=MyModelA, ids={"k": "v", ...})

One thing I haven't figured out entirely is if it makes sense to support moving an object between parents without delete/recreate

  • What would the API look like?
  • What would diffs look like? How would this be represented?
  • How would this change with multiple parents like in #210?
    • get_parent turns into get_parents, but do we keep the original and how does it behave?
    • Most of this feature request assumes only a single parent
  • Does it make sense to add this support in the same feature PR?
  • Does it need to be implemented with this feature or just designed in a way that doesn't inhibit this feature?

Implementation Thoughts

I've been thinking about the best place to store the parent/child relationship information. I think it's actually better to store them purely in the backend (in a dedicated tree structure) vs having the DiffSyncModel be aware of the parent/child relationships. Using this method would keep DiffSyncModel focused on storing data instead of getting involved in part of the tree.

@dataclass
class TreeNode:
    """Internal data structure to represent an object in a tree"""
    object: DiffSyncModel
    parent: None | DiffSyncModel = None
    children: None | Dict[str, Dict[str, DiffSyncModel]] = None  # children[model_name][identifier] = Object

    def add_child(self: Node, child: DiffSyncModel) -> TreeNode:
	"""Add a tree to a node"""
        n = TreeNode(object=child, parent=self)
	self.children = parent.children or {}  # or use defaultdict()
	self.children.setdefault(child._modelname, {})  # or use defaultdict()
	self.children[child._modelname][child.get_identifier()] = n
        return n

# How TreeNode could be represented in a backend
class DiffSync:
    top_level_nodes: Dict[str, Dict[str, TreeNode]]   # top_level_nodes[model][identifier] = TreeNode(...)
    all_nodes: Dict[str, Dict[str, TreeNode]]  # all_nodes[model][identifier] = TreeNode(...)

    # alternative - use a synthetic root node to reduce the number of special-cases for top-level objects
    root_node: TreeNode(object=None)
    all_nodes: Dict[str, Dict[str, TreeNode]]  # all_nodes[model][identifier] = TreeNode(...)

    # The root_node / top_level_nodes would establish only the top-level nodes
    # The all_nodes would allow for get_parent / get_children method to quickly locate an object in the tree,
    # as well as allow DiffSync to enforce key-uniqueness.

I'm going a bit off-topic, but instead of TreeNode storing a reference to a DiffSyncModel, it could instead store a reference to just the object's identifier; This might make it easier for Python's GC to do its job at the expense of having to look up an object by its key.

@dataclass
class TreeNode:
    """Internal data structure to represent an object in a tree"""
    model: str  # Model name
    id: str  # object key
    parent_model: None | str = None  # Parent model
    parent_id: None | str = None  # Parent key
    children: None | Dict[str, Set[str]] = None  # children[model_name] = Set(ids)  # Set of IDs

class DiffSync:
    # Key lookup of all objects in the backend
    all_objects: Dict[str, Dict[str, DiffSyncModel]] # all_objects[model][identifier] = DiffSyncModel()

    # Tree structure
    top_level_nodes: Dict[str, Dict[str, TreeNode]]   # top_level_nodes[model][identifier] = TreeNode(...)
    all_nodes: Dict[str, Dict[str, TreeNode]]  # all_nodes[model][identifier] = TreeNode(...)

Related Issues

  • This may be an appropriate feature for 2.0 #232
  • Would resolve #239, #225
  • Could potentially resolve #122
  • May be affected by #222 and #210

I have a counter proposal for this one. In the past I have worked with multiple complex diffsync implementations and what I have come to notice is that children themselves may be problematic. I have instead opted for an approach where we use a custom Diff class with custom ordering that should be generalizable to where children aren't needed. After all, the only problem they solve is ordering of operations, and if we can do away with their complexity and still solve that problem I think its a win. What do you think @jamesharr?

See #97 for a code example.

I have a counter proposal for this one. In the past I have worked with multiple complex diffsync implementations and what I have come to notice is that children themselves may be problematic. I have instead opted for an approach where we use a custom Diff class with custom ordering that should be generalizable to where children aren't needed. After all, the only problem they solve is ordering of operations, and if we can do away with their complexity and still solve that problem I think its a win. What do you think @jamesharr?

See #97 for a code example.

I think flattening the hierarchy and using a custom diff order is probably what I'm going to end up doing. I think you're right the most problems boil down to an order of operations problem, but not all.

I think native support for an arbitrary hierarchy might allow for smoother error handling (if the parent fails, don't try to process the children) and avoiding duplicate errors (deleting a parent deletes the child), but for what I'm doing that's a "nice to have"

Ty for the suggestion and link. I personally feel like storing the parent/child relationship in the Backend instead of the Model is a good idea though. It keeps the Model focussed on data and not relationships and I think it would simplify the API for adding a child object.