HTML Diff produces an OOM Exception
GoogleCodeExporter opened this issue · 5 comments
GoogleCodeExporter commented
What steps will reproduce the problem?
1. Run a diff using HTMLDiffer on the attached files.
2. HTML diff will produce an OOM Exception
3. Runtime Exception is thrown
What is the expected output? What do you see instead?
An accurate diff report is expected, or at least something to preempt the
OOM Exception.
What version of the product are you using? On what operating system?
Version 1.1, Linux Redhat 5 64 bit, JDK 6, 2GB heap size.
Please provide any additional information below.
We've got a rich text editor that allows users to paste html directory into
an html tab. The user in this case pasted html into the rich text tab as
opposed to the html tab which allows for direct html input. The result is
the html is escaped in one version (file2.txt), and not escaped in the
next(file1.txt). The text is the same in both versions, but the diff
contains too many elements that are different, and receives the following
exception:
Caused by: java.lang.OutOfMemoryError: Java heap space at
org.eclipse.compare.rangedifferencer.OldDifferencer.findDifferences(Unknown
Source)
at
org.eclipse.compare.rangedifferencer.RangeDifferencer.findDifferences(Unknown
Source)
at
org.eclipse.compare.rangedifferencer.RangeDifferencer.findDifferences(Unknown
Source)
at org.outerj.daisy.diff.html.HTMLDiffer.diff(Unknown Source)
Original issue reported on code.google.com by mccullough.todd
on 17 Mar 2010 at 4:14
Attachments:
GoogleCodeExporter commented
Confirmed on DaisyDiff 1.0 and 1.1. While the input files are themselves big,
it could
also be a memory leak. Notice however that the error is mentioned in Eclipse
Code and
not Daisy Diff. I do not have enough knowledge of the Eclipse differ to look
into this
(if indeed this is the problem).
Original comment by kkape...@gmail.com
on 17 Mar 2010 at 5:14
- Changed state: Accepted
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
GoogleCodeExporter commented
[deleted comment]
GoogleCodeExporter commented
Diffing is quadratic in the size of the documents. In DaisyDiff, this is the
number
of words. The escaped document looks to have very may words, and it doesn't
surprise
me that it's intractible to diff these documents.
Ofcourse, this doesn't prove that there isn't a memory leak.
Original comment by guy...@gmail.com
on 17 Mar 2010 at 5:50
- Added labels: ****
- Removed labels: ****
GoogleCodeExporter commented
Thanks for the insight. In the short term, I've been trying to determine a way
to opt
out of running a diff all together if certain conditions are met. It's
figuring out
the conditions that's the hard part. It doesn't surprise me that this fails
either,
but it doesn't seem to be the size of the html at all, just the number of
different
elements. "Normal" diff's between versions of the html that don't have the
html
escape characters works nicely, regardless of the size of the doc.
Original comment by mccullough.todd
on 17 Mar 2010 at 6:18
- Added labels: ****
- Removed labels: ****