Limit the parsing depth of the html parsing to avoid out of memory situations

Question

Limit the parsing depth of the html parsing to avoid out of memory situations

GoogleCodeExporter opened this issue 10 years ago · 1 comments

GoogleCodeExporter commented 10 years ago

What steps will reproduce the problem?

(using ver. 1.2.0)
1. HTMLParse "http://worldwidescience.org/topicpages/s.html". ArticleExtractor 
is just fine for demonstration purposes.

With 8GB of JVM-memory, this will result in an out of memory exception. 

Attached is a patch, which allows limiting the amount of TextBlocks being 
created/appended by boilerpipe. If that limit is reached, boilerpipe will 
ignore all further content from the parsed input.

Original issue reported on code.google.com by mstr...@gmail.com on 25 Nov 2013 at 4:29

Attachments:

boilerpipe-core.patch.tar.gz

Answer 1 · 2015-03-24T10:53:46.000Z

Please change type to "enhancement"

Original comment by mstr...@gmail.com on 26 Nov 2013 at 8:13