Generate an RSS feed using the Scrapy framework.
Install
scrapy-rss-exporter
usingpip
:pip install scrapy-rss-exporter
or using
setuptools
:python setup.py install
The most convenient way to use the exporter is to return the objects of RssItem
class from your spiders. This class derives from scrapy.Item
, so it will work with other exporters as well.
You will need to set the following keys:
from scrapy_rss_exporter.items import RssItem, Enclosure
rss_item = RssItem()
rss_item['title'] = 'Item title'
rss_item['link'] = 'Item url'
rss_item['guid'] = 'Item ID'
rss_item['description'] = 'Item Description'
rss_item['pub_date'] = None
rss_item['enclosure'] = [Enclosure(url=img, type='image/jpeg')]
The pub_date
field should contain a date in the RFC882 format. If you use None
, the system will insert the current date in the appropriate format. The enclosure
field is optional and should contain a (possibly empty) list of Enclosure
objects.
To set the exporter up globally, you need to declare it in the FEED_EXPORTERS
dictionary in the settings.py
file:
FEED_EXPORTERS = {
'rss': 'scrapy_rss_exporter.exporters.RssItemExporter'
}
You can then use it as a FEED_FORMAT
and specify the output file in the FEED_URI
:
FEED_FORMAT = 'rss'
FEED_URI = 's3://my-feeds/my-feed.rss'
Note: Bear in mind that, if you use a local file as output, scrapy
will append to an existing file resulting with an invalid RSS code. You should, therefore, make sure to delete any existing output file before running the spider. The s3
storage does not have this problem because scrapy
uploads are using the S3 PutObject
method.
scrapy
does not seem to allow to push any configuration option to an exporter. Therefore, if you want to customize the feed title and other metadata, you need to create a subclass and update the FEED_EXPORTERS
dictionary with the new class name:
class MyRssExporter(RssItemExporter):
def __init__(self, *args, **kwargs):
kwargs['title'] = 'My RSS'
kwargs['link'] = 'https://www.mywebsite.com'
kwargs['description'] = 'My RSS Items'
super(MyRssExporter, self).__init__(*args, **kwargs)
You can, of course, specify a different exporter with different settings for each spider. Just use the custom_settings
field to override the global configuration fields:
class MySpider(scrapy.Spider):
name = "my"
start_urls = ['https://www.mywebsite.com']
custom_settings = {
'FEED_EXPORTERS': {'rss': 'project.spiders.my_spider.MyExporter'},
'FEED_FORMAT': 'rss',
'FEED_URI': 's3://my-feeds/my-feed.rss',
}
def parse(self, response):
pass