Incomplete extraction of text with special characters
GoogleCodeExporter opened this issue · 0 comments
GoogleCodeExporter commented
Hello Boilerpipe,
When ArticleExtractor.INSTANCE.getText(url) is called for a web page that has a
code (like below) the function does not return the whole text.
The expected returned text [1] is the one extracted by the web boilerpipe.
The same result happens with versions 1.1.0 and 1.2.0. How can I have the
complete text extracted by the library as the web boilerpipe does?
Steps to reproduce the problem:
1. url =
http://supplesoftware.wordpress.com/2009/07/01/make-sure-you-get-all-your-messag
es-in-your-scala-code/
2. ArticleExtractor.INSTANCE.getText(url)
3. Returned incomplete text:
"Make sure you get all your messages in your Scala code 01Jul09 I had this
funny little Scala actor related bug today. Imagine you want to process a few
things in parallel. So you go: val processor = self jobs.foreach { job =>
actor { processor ! (job.id, job.run) } } // Merge results for (i <- 1 to
jobs.size) { self.receiveWithin(1000) { case (jobId:Int,
result:JobResult) => mergeResult(result) } } All good. Then you realise that
one or more of the jobs may fail with an exception. Which you have to handle
somehow. So, you think, you’ll break on the first exception and report back.
So you change that to: val processor = self jobs.foreach { job => actor {
try { processor ! (job.id, job.run) } catch { case ex:Throwable
=> processor ! (job.id, ex, job) } } } // Merge results for (i <- 1 to
jobs.size) { self.receiveWithin(1000) { case (jobId:Int,
result:JobResult) => mergeResult(result) case(jobId:Int, ex:Throwable,
job:Job) => throw new RuntimeException("Job " + jobId + " failed", ex) } }
Cool. You fail on the first one – you just blow up and report back to your
caller that something went wrong. Great! Well, not so…. Because the other
jobs you started are still going to send you their results. You’ve stopped
that thread by throwing an exception, so there’s nothing to receive those
messages. When the web-server (in this case) reuses that thread, it will be
sent all those messages. It won’t actually receive them until it hits the
receiveWithin methond, so when you are expecting the return from the freshly
started actors, you will actually be getting the messages from the actors that
broke while servicing the last request. Kind of undesirable really. One thing
you can do is wait for all the messages. Take a response back regardless of
what it is. This is what I did. Here’s the example: for (rowNumber <- 1 to
mainRowsCount) { self.receiveWithin(2000) { "
[1]
http://boilerpipe-web.appspot.com/extract?url=http%3A%2F%2Fsupplesoftware.wordpr
ess.com%2F2009%2F07%2F01%2Fmake-sure-you-get-all-your-messages-in-your-scala-cod
e%2F&extractor=ArticleExtractor&output=text&extractImages=
Thank you,
Alexandre Cançado Cardoso
Original issue reported on code.google.com by acc.intr...@gmail.com
on 24 Sep 2013 at 8:28