kurtmckee/feedparser

[Feature Request] Parsing HTML5 Pages

sjehuda opened this issue · 7 comments

It appears that the only Feed reader to handle <article/> tags is Liferea of Mr. Larse Windolf @lwindolf.

Intoduction:
Subscribing To Html5 Websites That Have No Feed

First commit:
Add support for subscribing to HTML5 websites without RSS/Atom feeds by extracting article titles, links and descriptions

Last commit to daye:
Improve HTML5 extraction: extract

if it exists and no article was found

Test page:
https://miranda-ng.org/
https://www.brandenburg.de/
http://intertwingly.net/blog/

Frankly, this is one of the best features of Liferea to date, namely because novice users don't need to handle scrapping for pages with <article/> tag.

You haven't actually reported a bug or requested a feature. I can guess what the point is, but please modify the text of your issue to include a feature request or a bug report. Thanks!

Title corrected accordingly

Thanks! So the request is: support extracting feed items directly from HTML data?

I think this is unacceptable.
Software should do one thing and do it well.

I want to close this issue (or change it).
If you want, I can provide, for feedparser documentation, a complement script that will scrap and guess Title and Summary from element </article> using lxml (XPath) and output an Atom feed using feedparser.

If someone has a problem with websites not providing web feeds (probably because they are unaware of this technology), contact the web admins. It's a better solution.

What do you think, @kurtmckee?

I would still like to see this in feedparser in the future, using the h-feed spec as a guide. For now, I'm fine with closing this issue.

I didn't know there's a specs documentation for </article>.
h-feed spec is definitely worth adding.
I just don't think that something so specific, let alone can be done by a relatively simple external script, is a sensible addition to feedparser.

Thank you for sharing h-feed!