Simple aggregator for RSS feeds for my own pleasure
In my example we use Microsoft tech blogs
https://techcommunity.microsoft.com/t5/security-compliance-and-identity/ct-p/MicrosoftSecurityandCompliance
hitem
-
Create a new GitHub repository and upload files:
Create a new public GitHub repository (e.g., "rss-aggregator"). You'll store the Python script, aggregated RSS feed and the workflow file in this repository.
Personally, i ran the python script localy to generate theaggregated_feed.xml
file - but the workflow should be able to do that for you, but maybe you dont want to wait. If you do so you have to install:pip install feedparser lxml beautifulsoup4
then run
python3 rss_aggregator.py
-
Set up GitHub Pages:
Go to the repository settings, scroll down to the GitHub Pages section, and choose the "main" branch as the source. Save the changes, and you'll get a URL for your GitHub Pages site (e.g., https://<username>
.github.io/rss-aggregator/). -
Update the 'link' field in the script: Replace the 'link' field in the
rss_aggregator.py
script with your GitHub Pages URL:etree.SubElement(channel, "link").text = "https://<username>.github.io/<repo name>/aggregated_feed.xml"
-
Update the workflow:
Change the timer accordingly inrss_aggregator.yml
The Cron job is the main one (how often it runs). But one more such setting is that links are only stored for 365 days undername: Update processed links file
to preventprocessed_links.txt
to grow to big. -
Then you take the link to
aggregated_feed.xml
and paste it in to your RSS hook (and digest frequenzy to match your cron configuration inrss_aggregator.yml
)
Example from teams:
- Script Execution: The cron job triggers the script every hour. The script fetches and processes RSS feed entries from the last 2 hours.
- Link Processing: The script writes new links to
processed_links.txt
and skips links that are already present inprocessed_links.txt
.
If you wish to change the time window for link collection, ensure that time_threshold
does not exceed your cron job interval. For example:
- Time Threshold Setting:
time_threshold = datetime.datetime.utcnow() - datetime.timedelta(hours=2)
- Cron Job Interval:
'0 */1 * * *'
Important: The time_threshold
should never exceed the cron job interval. If you want to collect links for a longer period (e.g., the last 30 days), ensure your cron job interval accommodates this timeframe. The Digest timer should also match your cron job interval to avoid duplicate entries.
To maintain comprehensive and timely updates, align the Digest interval with the cron job interval or a multiple of the processing window to ensure no links are missed. Here are specific recommendations:
- Align with the Cron Job: Keep the Digest interval at 1 hour to match the cron job and processing window. This alignment ensures no links are missed and duplicates are avoided.
- Match Processing Window: If a less frequent Digest is preferred, set it to 2 hours, matching the script's processing window. This setting ensures comprehensive coverage without duplication.
- Extended Time Window: If you wish to collect and process links over a longer period, such as 30 days, set the script's time threshold to 30 days and adjust the Digest interval to 30 days. Ensure the cron job interval supports this setup to avoid excessive overlap or missed links.
By following these guidelines, you can ensure that your RSS feed aggregator operates efficiently and effectively without missing or duplicating links.