img tags with missing src which are set via javascript or noscript show as empty
Pranoy1c opened this issue · 0 comments
The following page:
https://netflixtechblog.com/full-cycle-developers-at-netflix-a08c31f83249
has img
tags which have empty src
attribute. The src
is set via javascript upon scroll I think or via noscript
tags right after the img
tags.
Here's a piece of the page's HTML:
<img alt="" class="iq ir t u v is ak c" width="687" height="60" role="presentation"><noscript><img alt="" class="t u v is ak" src="https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png" width="687" height="60" srcSet="https://miro.medium.com/max/552/1*JnixtUHJjNYXNT15P42eJQ.png 276w, https://miro.medium.com/max/1104/1*JnixtUHJjNYXNT15P42eJQ.png 552w, https://miro.medium.com/max/1280/1*JnixtUHJjNYXNT15P42eJQ.png 640w, https://miro.medium.com/max/1374/1*JnixtUHJjNYXNT15P42eJQ.png 687w" sizes="687px" role="presentation"/></noscript></div></div></div><figcaption class="jd je cm ck cl jf jg en b eo ep fv" data-selectable-paragraph="">SDLC components</figcaption></figure>
This causes Readability to return empty images for the large images and tiny thumbnails only when using ReadabilityExtended.
I am able to solve the issue by searching for all img
tags with missing src
and then checking if such Element has a noscript
sibling with an img
in it and if so, then extract the src
from the noscript
and set it to the original img
:
I placed the following code at the very beginning of the protected open fun removeNoscripts(document: Document) {}
function in Preprocessor.kt
:
try {
document.select("img[src=\"\"], img:not([src])").forEach { img ->
// println("Empty: ${img}")
// println("Noscript: ${img.siblingElements().select("noscript")}")
img.siblingElements().select("noscript").firstOrNull()?.let {
img.attr("src",Jsoup.parse(it.html(), "", Parser.xmlParser()).selectFirst("img").attr("src"))
}
}
} catch (e: Exception) {
println("Exception in setting img for missing src from noscript tags")
}