dankito/Readability4J

None of images in Medium posts are shown

zjamshidi opened this issue · 12 comments

Hi
I'm using this great library to show HTML contents in our android app. It amazed and satisfies us. Big thanks for it. The only problem we faced is it detects images in Medium posts as clutter and removes them. Is there any way to prevent it?
You can check the following links as examples:
https://medium.com/storyshots/summary-of-extreme-ownership-by-jocko-willink-and-leif-babin-d161ff9ba347
https://medium.com/storyshots/the-historic-womens-suffrage-march-on-washington-e0d40f1e389b?sk=dfe29d96065cad93bd02f78ad60008ad
Thanks in advance.

Nice to hear that you like it!

Sorry, but I cannot reproduce the issue.

E. g. the second link gives me this result (simply remove the .txt file extension, GitHub doesn't allow uploading .html files):
suffrage-march.html.txt

Which version of Readability4J are you using?

When you checkout the source and enter above urls as third parameter in TestDataGenerator.kt -> main(Array) -> TestDataGenerator().generateTestData() (choose any test case name), which output gets generated? Are the images also missing?

Oh. Yeah I see images in your output. I used the TestDataGenerator.kt and it seems working.

I will change my implementation and come back soon

I cannot belive. I was using jsoup to extract the raw HTML, I changed it to okHttp and now it's showing the images! Thank you to mentioning TestDataGenerator.kt! it was a great hint :)

one more question regarding img tags, I know that you have fixed copying url from data-src attribute to to display lazy loading images. Do you have any solution for other lazy attributes e.g. data-lazy-src, data-delayed-url, and data-li-src?
Currently, I'm using Html.TagHandler to handle them.

I think I found the reason why there haven't been any images. The key is, that you requested the web page on Android. So I guess Medium returned a mobile version of the web site.

In TestDataGenerator simply enter this for DefaultUserAgent:
const val DefaultUserAgent = "Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/537.36 (KHTML, like Gecko) CChrome/60.0.3112.105 Safari/537.36"

So good that you opened this issue, I definitely have to fix this!

The lazy image loading attributes I process in PostprocessorExtended.fixRelativeImageUri() (I know the method name is misleading).

If you like to I can add the attributes you listed. But I think it will take till Sunday till I find the time to do that and release a new version. Will this be OK for you?
In the mean time you can subclass PostprocessorExtended and pass an instance of it to Readability4JExtended constructor.

I think I found the reason why there haven't been any images. The key is, that you requested the web page on Android. So I guess Medium returned a mobile version of the web site.

OKHttp works fine even with an empty string as DefaultUserAgent. I'm not sure if jsoup accept any agent or what.

Regarding lazy image sources, my solution (using Html.TagHandler) works fine now, we could wait for next release and we don't have rush for it.

By the way, thanks for your great help, our team appreciates your consideration and quick reply.

If I set the following for DefaultUserAgent, the images won't shown:
const val DefaultUserAgent = "Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/537.36 (KHTML, like Gecko) CChrome/60.0.3112.105 Safari/537.36"

empty string works fine, by the way.

@dankito a new issue with Medium posts emerges. The image sizes is too small and blurry. I tried different userAgent Strings but the same result. Would you please check it again?

One time, after several tries, I got the Big size images but I don't know how?!

Thanks in advance

Sorry for the late answer!

I think the issue is in Medium's HTML structure. Images sometimes look like this:

                        <figure class="in io ip iq ir eh jc it iu paragraph-image">
                            <div class="ix n ew iy">
                                <div class="jd n">
                                    <div class="cx iv fp p q fo ac cg v iw">
                                        <img  src="https://miro.medium.com/max/60/1*0hjrvN9jt7epvkRFovUQOQ.jpeg?q=20"
                                            class="fp p q fo ac ja jb" width="700" height="413"/>
                                    </div>
                                    <img class="cx iv fp p q fo ac" width="700" height="413"/>
                                    <noscript>
                                        <img src="https://miro.medium.com/max/1400/1*0hjrvN9jt7epvkRFovUQOQ.jpeg"
                                                   class="fp p q fo ac" width="700" height="413"/></noscript>
                                </div>
                            </div>
                        </figure>

So there are two issues:

  • Do you see the first <div> <img /> </div> ?
    The image from its src (https://miro.medium.com/max/60/1*0hjrvN9jt7epvkRFovUQOQ.jpeg?q=20) has a size of 60 x 35, but its width and height property give the <img> element a size of 700 x 413.
    These should be the ones that you mentioned with "too small and blurry".
    But I don't know how to detect these generically, it's a Medium specific thing and I think they remove or adjust this by JavaScript.
  • Just after this the <img /> <noscript> <img /> </noscript> structure.
    What has to be done here is unwrap the last <img> from <noscript> element and remove the first <img> element (that one without a src attribute). I also think Medium does this via JavaScript.
    Here as well, I don't know how to solve this generically. I tried to adjust the <noscript> handling in Preprocessor.removeScripts() and .shouldKeepImageInNoscriptElement(), but then many tests of other websites break.

Would it be a big effort to you if you implement custom image handling for Medium websites by yourself? You could either:

  • Remove all <img> and <div> with cx and iv classes (but don't know if this works generically, just from the html above),
  • Or remove images with no src and images which's src start with "https://miro.medium.com/max/60/" (same here, don't know if this works generically).

Yes, It seems the issue is the structure of the Medium posts. We just replace them with other websites, for now. thanks for your help

@dankito is help needed on this project? If yes, my email is in my profile.