Utility suite optimized for fast-feed finding of embedded RSS and iCal feeds. Its job is not to find all feeds on a page, but only to find the first. It uses a smart cache handler and attempts to make the fewest outgoing requests possible. This makes it exceptionally fast, and good for web applications.
FindFeed executes various strategies to find a feed. After executing a strategy, FindFeed validates the suspected feeds and returns the first valid one. Despite the fact that the strategies differ depending on whether one is looking for RSS or iCal feeds, strategies are somewhat broadly similar:
- Embedded strategies search for <iframe> tags and parse them accordingly. The Embedded ICS strategy currently only finds iframe-embedded google calendars as these are the most common.
- LinkRel strategies search for tags
- Hyperlink strategies search for hyperlinks (the tag) and inspect the href attributes and text contents for likely matches.
- ChildPageStrategies, technically a kind of Hyperlink strategy, searches hyperlinks on a given page looking for links to other pages (e.g. "Blog page", "news page", "calendar page", "events page", etc.) On these other pages, ChildPageStrategies execute some further number of strategies such as an embedded stratgegy, a linkrel strategy, or a hyperlink strategy. ChildPageStrategies are not recursive.
- Default strategies search for the presence of a given URL, e.g. "/feed". (Given that 30%+ of all websites have RSS feeds at this location due to running Wordpress, this is a surprisingly simple, fast and effective mechanism).
- Validator strategies simply use a validator to check whether the suspected URL is, in fact, a feed.
After a strategy has found a number of suspected feeds, FeedFinder attempts to validate these feeds and return the first one. Two validators are included:
- An ICS validator, which uses
ics.Calendar
- An RSS validator, which uses
feedparser
.
You must have both of these libraries, along with urlib
and requests
to use this utility.
Along with ics.Calendar and feedparser, BeautifulSoup (bs4) is also required. A requirements.txt file is provided for you to pass to pip.
CLI:
python findfeed.py https://example.com
By default, CLI will fetch both kinds of feed. To fetch only one, try --ics
or --rss
.
FeedFinder: from .FeedFinder import FeedFinder feeds = FeedFinder(self.website) if not self.news_feed: self.news_feed = feeds.get_rss_feed() if not self.ical_feed: self.ical_feed = feeds.get_ics_feed()
You can also use the feed validators separately by calling their validate_single
argument with the URL to check, for example:
from .validators import RssValidator
validator = RssValidator()
return HttpResponse() if validator.validate_single(request.POST.get("feed_url")) else HttpResponse(status=504)
Like the FeedFinder object itself, Validators cache the results of their validation.
- A flag should control whether it returns all feeds or just some.
- A CLI utility suite would be very nice.