StackOverflowError when page includes another <body> part in <noframes>
Opened this issue · 2 comments
GoogleCodeExporter commented
What steps will reproduce the problem?
- ArticleExtractor cannot process a web page having two <body> parts (like the
attached page) and results "java.lang.StackOverflowError".
What is the expected output? What do you see instead?
- "noframes" part is for browsers that do not support frames, so boilerpipe
should not take this part into consideration.
What version of the product are you using? On what operating system?
- boilerpipe 1.2.0 on Linux/Windows
Original issue reported on code.google.com by gural.vu...@gmail.com
on 14 May 2012 at 2:56
Attachments:
GoogleCodeExporter commented
Thanks for reporting.
This seems to be caused by a bug in NekoHTML 1.9.13
The corresponding stacktrace points at
"org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1003)"
The problem seems to go away after an update to NekoHTML 1.9.15.
Could you please confirm this?
Before upgrading boilerpipe to NekoHTML 1.9.15, I will have to perform some
extra checks, especially to ensure we don't get any regressions in terms of
extraction quality.
Best,
Christian
Original comment by ckkohl79
on 14 May 2012 at 4:44
- Changed state: Started
- Added labels: OpSys-All
GoogleCodeExporter commented
Thanks for quick-response.
As you've stated, the problem has gone away with NekoHTML 1.9.15.
Below is the list of changes in NekoHTML since ver.1.9.13 (which has been
released on 2 Sept 2009):
- Version 1.9.15 (3 Aug 2011)
Avoid using a synchronized structure (here java.util.Properties) to store built-in entities that are loaded at startup (#3001745), change INS to inline element, change BUTTON to inline element. don't parse body of IFRAME, add new feature http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe to allow empty IFRAME tags (default is false), make detected encoding available as Locator2.getEncoding() (#3381270).
- Version 1.9.14 (2 Feb 2010)
Don't parse body of NOFRAMES (fixes StackOverflowError reported in #2854697), TABLE can have multiple THEAD, TBODY and TFOOT (patch provided by Ahmed Ashour, #2893796), trim encoding found in meta tag (#2904817), fix ArrayIndexOutOfBoundException on empty attribute when using feature normalize-attrs(#2838901), recognize tags even if the > of the opening tag is missing (#2886227), only end TABLE can close a table (#2913095), fix StackOverflowError when parsing document fragment (#2911449), fix NullPointerException occurring with the insert-namespaces feature (#2942363).
I'm not pretty sure but I guess these changes do not affect the BoilerPipe's
extraction quality.
Looking forward to hearing about the result of your regression tests.
Regards,
Gural
Original comment by gural.vu...@gmail.com
on 14 May 2012 at 7:16