Suggested performance enhancements
Closed this issue · 6 comments
GoogleCodeExporter commented
I've been performance testing DaisyDiff on some large diffs and have made a few
speedups. I'm working on a local fork, but would obviously like to keep it as
close to the official version as possible.
I've attached a patch of what I've done and I'll describe it too:
1) getParentTree was very inefficient. For a node of depth D it would create
D - 1 lists of ancestors of lengths D-1 -> 0, and add each list's contents to
the one bigger than it. I've rewritten it to create a single list and then
reverse it. There are numerous other ways to rewrite it but this one's fairly
simple.
2) The dominating performance problem is that the AncestorComparator does lots
of work. I'm afraid I don't understand how the differ works well enough to
stop it doing so many comparisons, but I've done some work to make the
comparisons faster.
* I've made the Node store a reference to root, so that getting the root node
doesn't require traversing through all the ancestors. It means that whenever a
Node has its parent changed, the root needs to be updated on that node and all
its children. I've added that to setParent and from my limited testing this
seems to be a significant net-win.
* I've written a method to compare an AttributeMap's content with an Attributes
object, which cuts down on the amount of calls to constructing an AttributeMap.
I can see the advantage of having one AttributeMap to look up attributes by
name rather than index, but one or other set of Attributes still needs to be
iterated over, so there's no need to have two of them.
* I've added an IdentityMap of Attributes -> Boolean in TagNode. This prevents
the need to retest whether the TagNode's Attributes are equal to another
TagNode's Attributes. This happens a *lot* apparently, and I suspect there's a
better fix to be made to prevent the checks being performed at all.
3) the DomTreeBuilder was calling inPre a lot, which was a fairly
time-consuming thing to do since each inPre requires a traversal through to the
pre tag or the root node. I've made the DomTreeBuilder simply track whether or
not it is within a pre-tag rather than delegating that work to the current node
and its tree.
I'm wondering whether one of the committers has an existing set of test data to
use for performance testing, as I'd love to make sure I haven't made any known
pathalogical cases worse instead of better. On the comparison one of our users
encountered these changes reduce the time spent diffing by about 25%.
I tried setting up an IdentityHashMap to track equality tests of the whole
TagNode instead of the AttributeMaps, and this was slightly more effective but
since TagNodes can theoretically change their parents and thus the outcome of
equals, this seemed unsafe.
Original issue reported on code.google.com by don.jp.w...@gmail.com
on 5 May 2011 at 6:49
Attachments:
GoogleCodeExporter commented
As with all other patches, it would be good if this was peer reviewed by at
least one other person. I have some diff tests around that I can try, but it
would be great if people applied this patch locally and run it against their
own datasets.
Original comment by kkape...@gmail.com
on 6 May 2011 at 8:02
- Changed state: Accepted
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
According to my 1000+ test cases this patch
A. Does not break anything.
B. Offers no noticeable speed regressions.
My tests run about 1-2% percent faster which is close to the statistical error.
Has anybody else run any benchmarks? I will perform some more detailed tests
but unless there is an objection, I think this can go safely to trunk.
Original comment by kkape...@gmail.com
on 10 May 2011 at 7:15
- Changed state: Started
- Added labels: Type-Enhancement
- Removed labels: Type-Defect
GoogleCodeExporter commented
Hmmm... I just noticed that all my tests use the DaisyDiff "tag" mode, while
the patch is against the "html" mode. This is why the behavior is essentially
unchanged.
I will see if I can convert them to "html" mode and run them again.
Original comment by kkape...@gmail.com
on 10 May 2011 at 7:31
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
The fixes are definitely specific to html mode. They also may make little
difference to documents with very shallow trees.
Original comment by don.jp.w...@gmail.com
on 16 May 2011 at 4:40
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
I don't seem to have big enough tests. Those that I have tried have 5%-20%
speedup. They are mostly html tables.
Anyway I think this can go to trunk. Anybody has a different view on this patch?
Original comment by kkape...@gmail.com
on 19 May 2011 at 8:30
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
Applied in r150
Original comment by kkape...@gmail.com
on 28 May 2011 at 6:57
- Changed state: Fixed
- Added labels: ****
- Removed labels: ****