dankito/Readability4J

img tags with missing src which are set via javascript or noscript show as empty

Pranoy1c opened this issue · 0 comments

The following page:

https://netflixtechblog.com/full-cycle-developers-at-netflix-a08c31f83249

has img tags which have empty src attribute. The src is set via javascript upon scroll I think or via noscript tags right after the img tags.

Here's a piece of the page's HTML:

<img alt="" class="iq ir t u v is ak c" width="687" height="60" role="presentation"><noscript><img alt="" class="t u v is ak" src="https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png" width="687" height="60" srcSet="https://miro.medium.com/max/552/1*JnixtUHJjNYXNT15P42eJQ.png 276w, https://miro.medium.com/max/1104/1*JnixtUHJjNYXNT15P42eJQ.png 552w, https://miro.medium.com/max/1280/1*JnixtUHJjNYXNT15P42eJQ.png 640w, https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png 687w" sizes="687px" role="presentation"/></noscript></div></div></div><figcaption class="jd je cm ck cl jf jg en b eo ep fv" data-selectable-paragraph="">SDLC components</figcaption></figure>

This causes Readability to return empty images for the large images and tiny thumbnails only when using ReadabilityExtended.

I am able to solve the issue by searching for all img tags with missing src and then checking if such Element has a noscript sibling with an img in it and if so, then extract the src from the noscript and set it to the original img:

I placed the following code at the very beginning of the protected open fun removeNoscripts(document: Document) {} function in Preprocessor.kt:

try {
    document.select("img[src=\"\"], img:not([src])").forEach { img ->

//                println("Empty: ${img}")
//                println("Noscript: ${img.siblingElements().select("noscript")}")

        img.siblingElements().select("noscript").firstOrNull()?.let {
            img.attr("src",Jsoup.parse(it.html(), "", Parser.xmlParser()).selectFirst("img").attr("src"))
        }
    }
} catch (e: Exception) {
    println("Exception in setting img for missing src from noscript tags")
}