dankito/Readability4J

Preprocessor.kt has a bug at line 66

swuqi opened this issue · 1 comments

swuqi commented

If some web page includes a image tag like in the following block,
This operation ${image.attr("src") will be "", and this line of code document.select("img[src=${image.attr("src")}]") will cause Jsoup's select crash.

<noscript>
    <div>
        <img data-src="//www.bizographics.com/collect/?fmt=gif&amp;pid=7850&amp;ts=noscript" width="1" height="1" alt="" />
    </div>
</noscript>

Test url: https://www.msn.com/en-us/news/technology/facebook-says-attackers-stole-details-from-29-mln-users/ar-BBOiiJa

I have tried to make the fix directly in the main branch. Later my fix was gone. I don't know why.

Also another suggestion is to make the document member variable of Readability4J.kt public. So after the HTML text is parsed by Jsoup, the client code can easily manipulate the parsed DOM structure such as removing some specific elements.

Thank you for your great effort to make the code to Java-enable.

Thanks for reporting this bug!

And it was easy to reproduce thanks to your detailed description.

Actually the example provided violates the HTML specification as the src attribute is mandatory for elements, but of course I fixed the according line in Preprocessor class.

I also mentioned you there as the reporter. If you want to have the attribution changed, drop me a line.

Just released version 1.0.2 with the fix, in about 2 hours it should be visible on Maven Central.

Also another suggestion is to make the document member variable of Readability4J.kt public. So after the HTML text is parsed by Jsoup, the client code can easily manipulate the parsed DOM structure such as removing some specific elements.

As all the code does is creating a Jsoup document from the provided HTML String, this actually can already simply be done by using the constructor with a document parameter:

val html = ...
val document = Jsoup.parse(html)
// adjust document here to your needs
val article = Readability4J(uri, document).parse()