evolve75/RubyTree

Unique node names

sirsean opened this issue · 11 comments

Pull request #9 changed the behavior of node name uniqueness, and this went out in release 0.9.3.

In the discussion on that pull request (#9 (comment)), it was pointed out "there is existing code which is using duplicate node names already, and this fix will break that code", which is true and broke my code which was using version 0.8.3.

From what I can tell, the reasoning behind the fix was twofold:

  • The documentation states that the name must be unique within the entire tree
  • The name is used implicitly as an "id" across the whole tree

I propose that a tree that looks like this should be perfectly valid:

* root
|---+ one
|    +---> deep
+---+ two
    +---> deep

The fact that there are two nodes with the name "deep" in this tree is irrelevant. They are not the same node, they just happen to have the same name.

Conceptually, when you retrieve nodes out of the tree, you do it like so:

root["one"]["deep"]
root["two"]["deep"]

If we expected all node names to be unique and used as a key within the entire tree, then we would expect this to work:

root["deep"]

But it doesn't, and it shouldn't.

It's worth noting that a node's name is used as an implicit id in two ways:

  • As the key in @children_hash, which implies that the name is unique only among its siblings
  • As the comparator in #<=>, which would then sort all the nodes in the tree by their name alone, which seems wrong

Rather than fundamentally changing the behavior of the tree to match the documentation, I think the following changes should be made:

  • Fix the documentation such that node names must be unique among their siblings, not among the entire tree
  • Change Tree::TreeNode#<=> so that the comparison is not based only on the node's name, but on a combination of the node's parent (or parentage array, if you prefer) and the node's name

Thoughts?

+1

@sirsean,

Thanks for highlighting this issue. Mea culpa. I had myself put in the comment about the possibility of code already using duplicates.

I have pushed out a fix with a test case (0.9.3pre, f7f6377) into the master branch now, which seems to be handling the scenario correctly.

  • For the two actions you mentioned, the first fix (documentation, and allowing duplicates as long as they are not siblings) is taken care of.

This check does need to be configurable though, as there might be scenarios (ref #9) where a truly unique node set is required in the tree. Any ideas appreciated. I was thinking about a add_nodups method, but it will not allow the convenience of the << operation. Perhaps a decorator, or a class-level configurator of some sort …

  • The second item (changing Tree::TreeNode#<=>) need more discussion and feedback. The main intent of having the spaceship operator in RubyTree is to provide compatibility to the Comparable interface. Also, there might be scenarios where detached, individual nodes are being compared before addition to a tree, where the current implementation makes sense.

While it should be possible to implement the comparison within the tree's context (i.e., taking into account the parent node, parentage/edges, etc.), it might make the system too restrictive (similar to what happened here).

The Enumerable.sort does return nodes based on the node name right now, which seems to be a good default, and in any case, Enumerable.sort_by can be used for any other ordering.

Any thoughts on the above?

@evolve75 Thanks for jumping on this so quickly.

I'm not convinced it does need to be configurable. PR #9 did not actually claim to have a use case for globally unique node names, he just said he was making the behavior consistent with the documentation.

In fact, I posit that configurable behavior is incorrect: the node names should not be globally unique. A node in a tree is defined by the combination of its name and its position in the tree: ie, its "path". "root->one->deep" simply has nothing to do with "root->two->deep" despite the same leaf node name.

Tree::TreeNode#<=> is the only place where #name is actually used as a unique identifier across the entire tree. So, for the sake of conceptual integrity, that should be fixed; while true that the reason to implement #<=> is to adhere to the Comparable interface, the intent of Comparable is to actually compare the nodes. The node name does not identify a node, so simply comparing on the name is an incomplete and inadequate comparison. As stated above, a node is defined by its path; as such, to correctly implement Comparable you should compare the entire path, not just the name.

And there is no scenario in which comparing an entire node that is not in a tree to a node that already is in a tree makes sense. Before a node has been added to the tree, it is not "equal" to a node in the tree that happens to have the same name. "root->one->deep" is not equal to "root->two->deep" OR to "someothertree->deep" OR to "deep".

And finally, I would say that having Enumerable#sort return nodes sorted based on the node name is actually a pretty crappy default. A "good" default would be, say, the same order as either #each or #breadth_each.

For example, say you have the tree we're talking about:

* root
|---+ one
|    +---> deep
+---+ two
    +---> deep

And you call #sort on it. The resulting [deep, deep, one, root, two] is completely meaningless. I, and I think anyone who thinks they're dealing with a tree data structure, would expect it to be either [root, one, deep, two, deep] or [root, one, two, deep, deep].

@sirsean,

Thanks for the detailed feedback. I have released the rubytree-0.9.3pre gem. Can you please check if it addresses the issue with globally unique names ([sudo] gem install rubytree --pre)?

For the <=> operator, your points are valid, but we will need to think through on what other implications would this change result in.

Also, this will necessarily have a particular traversal mechanism imposed on the sort, and we will also need to traverse the tree (at least for the edges and parentage involved for the two nodes) to address arbitrary node compares.

Any thoughts?

I just fired up my old code which is using siblings with same name, is there any way to reenable this again?

So "root->one" is not the same as "root->one->one"

Could we get this "feature" back?

@igorkova,

I have released R0.9.3 which fixes this issue. The node names are no longer required to be globally unique. They will need to be unique between siblings though.

Please update to the latest version and let me know if the behavior is as expected.

@sirsean,

The new release takes care of the issue globally unique node names. I am working out the comparison operator logic for the nodes (<=>), and will update it in a subsequent release. Given the breakage the unique node _mis-_feature might have caused to existing code, I wanted to get this fixed quickly.