Content Aggregator

Description

Combination web scraper and Drupal uploader. The content sources listed below are scraped (raw or via RSS), the entries stored in a local SQLite database, then uploaded to a Drupal instance via the REST API (part of the services module).

Content Sources

World Bank South Asia

World Bank East Asia

Drupal Instance Preparation

Required Modules

services
libraries (upgrade to >= 2.2)

Steps

Enable services module
1. drush pm-download services && drush pm-enable services
Enable REST Server module
1. drush pm-enable rest_server
Clear Drupal cache
1. drush cc all
Add service endpoint (/admin/structure/services/add)
1. Name: api
2. Server: REST
3. Path: api
4. Session authentication: checked
Edit endpoint resources (/admin/structure/services/list/api/resources)
1. Enable node/create resource
2. Enable user/login resource
Edit endpoint REST parameters (/admin/structure/services/list/api/server)
1. Response formatters: json only
2. Request parsing: application/json only
Create user feed with developer role

Running the Aggregator

Requirements

Python >= 2.6
virtualenv Python library
sqlite3 system library

Steps

Edit drupal.env.sample in the source tree to match your instance's parameters
Save as drupal.env
Execute run.sh from the project root
- If the internal scraper database should be cleared, either delete db/scraper.sqlite or run the scraper manually for the first time: ./run.sh --kill-db
- For cron, run it like this (probably at midnight): cd <scraper dir> && ./run.sh

Command-line Options (to `run.sh`)

--no-scrape: skip content scraping
--no-post: skip content upload
--post-limit <N>: only upload the first N items to Drupal
--debug: show debug info
--db <db>: specify database file (default: db/scraper.sqlite)
--kill-db: delete database before start
--events-only: only post events to Drupal
--pubs-only: only post pubs to Drupal
--show-pending: print number of pending things
--only <scraper>: only run specified scraper (see scrapers.txt)

Notes

All uploaded items are unpublished by default.
Date limit for articles is January 1, 2014 and January 1, 2010 for events and publications.

Known Issues

APAARI Events RSS feed does not include parseable event dates

gargantuanprism/content-aggregator

Content Aggregator

Description

Content Sources

World Bank South Asia

World Bank East Asia

Asian Development Bank

ASEAN

UNESCAP

CACARRI

APAARI

University of Central Asia

Drupal Instance Preparation

Required Modules

Steps

Running the Aggregator

Requirements

Steps

Command-line Options (to run.sh)

Notes

Known Issues

Command-line Options (to `run.sh`)