jocmp/capyreader

Missing images in full content mode

Closed this issue · 14 comments

jocmp commented

Background

Capy Reader uses a library called Readability4J that has a few rules to parse the article's full content.

Sometimes those rules fail leading to missing images in Capy's full content mode. This is an annoying issue without a single fix-all solution. Every website is different and changes over time which is part of the beauty and chaos of the web.

If you run into this issue with a feed, please post a link to the feed with an example to this thread. I'll track these to fix some point in the future. Thanks!

Feeds

The articles' main image isn't shown in Capy's full content mode for the following feed:
https://mobilesyrup.com/feed/

Article example:
https://mobilesyrup.com/2024/11/28/google-releases-ai-generated-pieces-chess-game/

(I only noticed this today so maybe it used to work?)

HTML of the image:
<img fetchpriority="high" width="1867" height="1046" src="https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess.jpg" class="attachment-full size-full wp-post-image" alt="" decoding="async" srcset="https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess.jpg 1867w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-300x168.jpg 300w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-1024x574.jpg 1024w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-768x430.jpg 768w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-1536x861.jpg 1536w, https://cdn.mobilesyrup.com/wp-content/uploads/2024/11/gen-ai-chess-417x235.jpg 417w" sizes="(max-width: 1867px) 100vw, 1867px" />

jocmp commented

@PhilC813 an update. I'm toying around with Mercury Parser again and seeing some potential upsides. Here's a comparison of a Les Versants article.

Before After

Mobile Syrup

Before After

Waw, seems very promising.

Do you mind checking with this article?
https://mobilesyrup.com/2024/11/28/here-are-the-2024-staples-black-friday-deals/

It's an article with Black Friday deals, and the current parser basically removes all the bullet points in which the deals are listed 😅

jocmp commented

The new parser skips over lists by default, but with a little bit of code it works: https://github.com/jocmp/capyreader/pull/569/files#diff-a5310ab57bf17835286b2a012ceca522b0f9af190ceeea2dcf80c52f82c6479dR41-R49

So you can easily specify the <ul> tag as an exception, sweet. Frankly I don't really see a reason why they would be excluded by default. They are more likely to be content than ads.

Also, is there any parser that is still actively maintained? Mercury seems abandoned like Readability4J. It's not necessarily a problem, but having an active project is always a +.

jocmp commented

Couldn't agree more. I think Mercury is more extensible and maintainable between the two. I forked it and I'm working on bringing its dependencies up to date here: https://github.com/jocmp/mercury-parser.

I've updated the app to 2024.12.1080-dev and despite the reintroduction of Mercury, I'm not seeing the results you shared above with the quick check I've done with the feed "Les Versants".

Screenrecorder-20241204-011225.mp4

As you can see, in the same article you used for testing, the headline is still missing, and all those grey enclosures further down actually correspond to ad placements. Then there's the last ad of the page that does manage to render.


Also, it seems like the sticky configuration of the "Extract full content" button doesn't work properly in this build.

In an article, if you tap the button to turn it off, then tap it again to turn it back on, and move to article of the same feed, it will be off upon opening an article of the same feed.

jocmp commented

Let me take another look. I may be able to filter out those ad placements too. Just to make sure I'm testing the same thing, are you using a local account?

About the sticky config, I'm able to reproduce that bug. I'll follow up with a different ticket to fix that. #576

Just to make sure I'm testing the same thing, are you using a local account?

I'm using Capy with my Feedbin account.

About the sticky config, I'm able to reproduce that bug. I'll follow up with a different ticket to fix that. #576

Don't give up!! 😆

jocmp commented

Aha, I use Feedbin's copy of Mercury Parser for those accounts. Local accounts rely on the Mercury Parser that I'm updating. So they're different right now.

I'll see what I can do to use the same version of the parser everywhere. It should result in a more consistent experience across the board.

jocmp commented

@PhilC813 I enabled the updated Mercury Parser for Feedbin accounts in 2024.12.1081-dev and also fixed the sticky content bug. Let me know how it works for you!

Seeing some extremely positive results so far. I'm also seeing some YouTube videos that were filtered out before now being displayed properly. Solid update..!

Possible to fix articles for this domain?

Seems like all the text and images in their articles are missing/incomplete.

Below are some examples

https://www.hardwarezone.com.sg/feature-how-spot-potential-scam-messages-ios-and-android-singapore-rcs-sms

https://www.phoronix.com/news/Raspberry-Pi-HEVC-H265-Decode

jocmp commented

hey @privacyadmin I'll take a look. Can you open a new issue for each of those feeds using this template? https://github.com/jocmp/capyreader/issues/new?labels=full%20content%20request&template=2-full-content-request.yml

I want to close out this mega-issue since it's hard to track