eafer/rdrview

rdrview does not extract titles

Opened this issue · 2 comments

Hi ! Thanks for rdrview.

I found that, on some websites, it does not extract titles.
An example:
this article looks normal in firefox reader view :
screenshot-24-06-25-18-52-21

but with rdrview, there are no titles, only paragraphs:
screenshot-24-06-25-18-53-02

On other websites, it sometimes displays subtitles normally but not the main title.

I use rdrview build from latest commit with gcc on alpine linux x86_64.

If you have an idea on why this happens, I would be happy to know.

What goes wrong here is that the page you link is using h1 tags for the section titles, and rdrview expects that to be used only for the main title, so they get removed. It seems that firefox used to have this issue too, but it got fixed a few years ago: mozilla/readability@11093f011f57fa528a0. So I need to port that patch for rdrview, but it's not trivial because it uses a unicode regex.

Ok, thanks for the explanation, I'll wait.