CADbloke/daisydiff

Suggested performance enhancements

Closed this issue · 6 comments

I've been performance testing DaisyDiff on some large diffs and have made a few 
speedups.  I'm working on a local fork, but would obviously like to keep it as 
close to the official version as possible.

I've attached a patch of what I've done and I'll describe it too:
1)  getParentTree was very inefficient.  For a node of depth D it would create 
D - 1 lists of ancestors of lengths D-1 -> 0, and add each list's contents to 
the one bigger than it.  I've rewritten it to create a single list and then 
reverse it.  There are numerous other ways to rewrite it but this one's fairly 
simple.

2)  The dominating performance problem is that the AncestorComparator does lots 
of work.  I'm afraid I don't understand how the differ works well enough to 
stop it doing so many comparisons, but I've done some work to make the 
comparisons faster.
* I've made the Node store a reference to root, so that getting the root node 
doesn't require traversing through all the ancestors.  It means that whenever a 
Node has its parent changed, the root needs to be updated on that node and all 
its children.  I've added that to setParent and from my limited testing this 
seems to be a significant net-win.  
* I've written a method to compare an AttributeMap's content with an Attributes 
object, which cuts down on the amount of calls to constructing an AttributeMap. 
 I can see the advantage of having one AttributeMap to look up attributes by 
name rather than index, but one or other set of Attributes still needs to be 
iterated over, so there's no need to have two of them.
* I've added an IdentityMap of Attributes -> Boolean in TagNode.  This prevents 
the need to retest whether the TagNode's Attributes are equal to another 
TagNode's Attributes.  This happens a *lot* apparently, and I suspect there's a 
better fix to be made to prevent the checks being performed at all.

3) the DomTreeBuilder was calling inPre a lot, which was a fairly 
time-consuming thing to do since each inPre requires a traversal through to the 
pre tag or the root node.  I've made the DomTreeBuilder simply track whether or 
not it is within a pre-tag rather than delegating that work to the current node 
and its tree.

I'm wondering whether one of the committers has an existing set of test data to 
use for performance testing, as I'd love to make sure I haven't made any known 
pathalogical cases worse instead of better.  On the comparison one of our users 
encountered these changes reduce the time spent diffing by about 25%.

I tried setting up an IdentityHashMap to track equality tests of the whole 
TagNode instead of the AttributeMaps, and this was slightly more effective but 
since TagNodes can theoretically change their parents and thus the outcome of 
equals, this seemed unsafe.

Original issue reported on code.google.com by don.jp.w...@gmail.com on 5 May 2011 at 6:49

Attachments:

As with all other patches, it would be good if this was peer reviewed by at 
least one other person. I have some diff tests around that I can try, but it 
would be great if people applied this patch locally and run it against their 
own datasets.

Original comment by kkape...@gmail.com on 6 May 2011 at 8:02

  • Changed state: Accepted
  • Added labels: ****
  • Removed labels: ****
According to my 1000+ test cases this patch

A. Does not break anything.
B. Offers no noticeable speed regressions.

My tests run about 1-2% percent faster which is close to the statistical error.

Has anybody else run any benchmarks? I will perform some more detailed tests 
but unless there is an objection, I think this can go safely to trunk.

Original comment by kkape...@gmail.com on 10 May 2011 at 7:15

  • Changed state: Started
  • Added labels: Type-Enhancement
  • Removed labels: Type-Defect
Hmmm... I just noticed that all my tests use the DaisyDiff "tag" mode, while 
the patch is against the "html" mode. This is why the behavior is essentially 
unchanged.

I will see if I can convert them to "html" mode and run them again.

Original comment by kkape...@gmail.com on 10 May 2011 at 7:31

  • Added labels: ****
  • Removed labels: ****
The fixes are definitely specific to html mode.  They also may make little 
difference to documents with very shallow trees.

Original comment by don.jp.w...@gmail.com on 16 May 2011 at 4:40

  • Added labels: ****
  • Removed labels: ****
I don't seem to have big enough tests. Those that I have tried have 5%-20% 
speedup. They are mostly html tables.

Anyway I think this can go to trunk. Anybody has a different view on this patch?

Original comment by kkape...@gmail.com on 19 May 2011 at 8:30

  • Added labels: ****
  • Removed labels: ****
Applied in r150

Original comment by kkape...@gmail.com on 28 May 2011 at 6:57

  • Changed state: Fixed
  • Added labels: ****
  • Removed labels: ****