[Feature Request] Parsing HTML5 Pages
sjehuda opened this issue · 7 comments
It appears that the only Feed reader to handle <article/>
tags is Liferea of Mr. Larse Windolf @lwindolf.
Intoduction:
Subscribing To Html5 Websites That Have No Feed
First commit:
Add support for subscribing to HTML5 websites without RSS/Atom feeds by extracting article titles, links and descriptions
Last commit to daye:
Improve HTML5 extraction: extract
Test page:
https://miranda-ng.org/
https://www.brandenburg.de/
http://intertwingly.net/blog/
Frankly, this is one of the best features of Liferea to date, namely because novice users don't need to handle scrapping for pages with <article/>
tag.
You haven't actually reported a bug or requested a feature. I can guess what the point is, but please modify the text of your issue to include a feature request or a bug report. Thanks!
Title corrected accordingly
Thanks! So the request is: support extracting feed items directly from HTML data?
I think this is unacceptable.
Software should do one thing and do it well.
I want to close this issue (or change it).
If you want, I can provide, for feedparser documentation, a complement script that will scrap and guess Title and Summary from element </article>
using lxml (XPath) and output an Atom feed using feedparser.
If someone has a problem with websites not providing web feeds (probably because they are unaware of this technology), contact the web admins. It's a better solution.
What do you think, @kurtmckee?
I would still like to see this in feedparser in the future, using the h-feed spec as a guide. For now, I'm fine with closing this issue.
I didn't know there's a specs documentation for </article>
.
h-feed spec is definitely worth adding.
I just don't think that something so specific, let alone can be done by a relatively simple external script, is a sensible addition to feedparser.
Thank you for sharing h-feed!