k-bx/boilerpipe

Limit the parsing depth of the html parsing to avoid out of memory situations

GoogleCodeExporter opened this issue · 1 comments

What steps will reproduce the problem?

(using ver. 1.2.0)
1. HTMLParse "http://worldwidescience.org/topicpages/s.html". ArticleExtractor 
is just fine for demonstration purposes.

With 8GB of JVM-memory, this will result in an out of memory exception. 

Attached is a patch, which allows limiting the amount of TextBlocks being 
created/appended by boilerpipe. If that limit is reached, boilerpipe will 
ignore all further content from the parsed input.

Original issue reported on code.google.com by mstr...@gmail.com on 25 Nov 2013 at 4:29

Attachments:

Please change type to "enhancement"

Original comment by mstr...@gmail.com on 26 Nov 2013 at 8:13