The code Media Cloud server wasn't performing well, so we made this quick and dirty backup project. It gets a prefilled in list of the RSS feeds MC usually scrapes each day (~130k). Then throughout the day it tries to fetch those. Every night it generates a synthetic RSS feed with all those URLs.
Files are available afterwards at http://my.server/rss/mc-YYYY-MM-dd.rss.gz
.
See documentation in doc/ for more details.
For development using dokku, see doc/deployment.md
For development directly on your local machine:
- Install postgresql & redis
- Create a virtual environment:
python -mvenv venv
- Active the venv:
source venv/bin/activate
- Install prerequisite packages:
pip install -r requirements.txt
- Create a postgres user:
sudo -u postgres createuser -s MYUSERNAME
- Create a database called "rss-fetcher" in Postgres:
createdb rss-fetcher
- Run
alembic upgrade head
to initialize database. cp .env.template .env
(little or no editing should be needed)
Shell script autopep8.sh will run autopep8 on all .py files, and mypy.sh will run type checking. BOTH should be run before merging to main (or submitting a pull request).
Various scripts run each separate component:
python -m scripts.import_feeds my-feeds.csv
: Use this to import from a CSV dump of feeds (a one-time operation)run-fetch-rss-feeds.sh --loop 5
: Runs continuously, adding ready feeds to work queue.run-rss-workers.sh
: Start a single worker process servicing the work queue.run-gen-daily-story-rss.sh
: Generate the daily files of URLs found on each day (run nightly)python -m scripts.db_archive
: archive and trim fetch_events and stories tables (run nightly)
- doc/database-changes.md describes how to implement database migrations.
- doc/stats.md describes how monitoring is implemented.
See doc/deployment.md and dokku-scripts/README.md for procedures and scripts.