[Feature Request] Parsing HTML5 Pages

Question

[Feature Request] Parsing HTML5 Pages

sjehuda opened this issue 2 years ago · 7 comments

It appears that the only Feed reader to handle <article/> tags is Liferea of Mr. Larse Windolf @lwindolf.

Intoduction:
Subscribing To Html5 Websites That Have No Feed

First commit:
Add support for subscribing to HTML5 websites without RSS/Atom feeds by extracting article titles, links and descriptions

Last commit to daye:
Improve HTML5 extraction: extract

if it exists and no article was found

Test page:
https://miranda-ng.org/
https://www.brandenburg.de/
http://intertwingly.net/blog/

Frankly, this is one of the best features of Liferea to date, namely because novice users don't need to handle scrapping for pages with <article/> tag.

Answer 1 · 2022-05-19T11:58:08.000Z

You haven't actually reported a bug or requested a feature. I can guess what the point is, but please modify the text of your issue to include a feature request or a bug report. Thanks!

Answer 2 · 2022-05-19T13:04:00.000Z

Title corrected accordingly

Answer 3 · 2022-05-19T13:35:21.000Z

Thanks! So the request is: support extracting feed items directly from HTML data?

Answer 4 · 2022-05-19T15:44:40.000Z

On Thu, 19 May 2022 06:35:32 -0700 Kurt McKee ***@***.***> wrote: Thanks! So the request is: support extracting feed items directly from HTML data?

Yes, but only on certain occasions, just like Liferea. Of course, this leaves use with a limited options because we are guessing an </article> entry. Apparently, some websites that don't provide feeds, are useful when treated as feeds, hence I think a very-specific guessing mechanism is worth to have.

Answer 5 · 2022-07-14T13:21:06.000Z

I think this is unacceptable.
Software should do one thing and do it well.

I want to close this issue (or change it).
If you want, I can provide, for feedparser documentation, a complement script that will scrap and guess Title and Summary from element </article> using lxml (XPath) and output an Atom feed using feedparser.

If someone has a problem with websites not providing web feeds (probably because they are unaware of this technology), contact the web admins. It's a better solution.

What do you think, @kurtmckee?

Answer 6 · 2022-07-14T13:39:07.000Z

I would still like to see this in feedparser in the future, using the h-feed spec as a guide. For now, I'm fine with closing this issue.

Answer 7 · 2022-07-14T13:47:02.000Z

I didn't know there's a specs documentation for </article>.
h-feed spec is definitely worth adding.
I just don't think that something so specific, let alone can be done by a relatively simple external script, is a sensible addition to feedparser.

Thank you for sharing h-feed!