Combination web scraper and Drupal uploader. The content sources listed below are scraped (raw or via RSS), the entries stored in a local SQLite database, then uploaded to a Drupal instance via the REST API (part of the services
module).
- Articles: http://www.worldbank.org/en/region/sar/whats-new
- Publications: http://www.worldbank.org/en/region/sar/research/all?majdocty_exact=Publications+%26+Research&qterm=&lang_exact=English
- Articles: http://www.worldbank.org/en/region/eap/whats-new
- Publications: http://www.worldbank.org/en/region/eap/research/all?majdocty_exact=Publications+%26+Research&qterm=&lang_exact=English
- Articles (RSS): http://feeds.feedburner.com/adb_news
- Publications (RSS): http://feeds.feedburner.com/adb_publications
- Articles: http://www.asean.org/news
- Articles: http://www.unescap.org/media-centre/feature-stories
- Events: http://www.unescap.org/events/upcoming
- Publications: http://www.unescap.org/publications
- Articles: http://www.cacaari.org/en.php?/news
- Events (RSS): http://www.apaari.org/events/feed
- Articles: http://www.ucentralasia.org/news.asp
services
libraries
(upgrade to >= 2.2)
- Enable
services
moduledrush pm-download services && drush pm-enable services
- Enable
REST Server
moduledrush pm-enable rest_server
- Clear Drupal cache
drush cc all
- Add service endpoint (
/admin/structure/services/add
)- Name:
api
- Server:
REST
- Path:
api
- Session authentication: checked
- Name:
- Edit endpoint resources (
/admin/structure/services/list/api/resources
)- Enable
node/create
resource - Enable
user/login
resource
- Enable
- Edit endpoint REST parameters (
/admin/structure/services/list/api/server
)- Response formatters:
json
only - Request parsing:
application/json
only
- Response formatters:
- Create user
feed
withdeveloper
role
- Python >= 2.6
virtualenv
Python librarysqlite3
system library
- Edit
drupal.env.sample
in the source tree to match your instance's parameters - Save as
drupal.env
- Execute
run.sh
from the project root- If the internal scraper database should be cleared, either delete
db/scraper.sqlite
or run the scraper manually for the first time:./run.sh --kill-db
- For
cron
, run it like this (probably at midnight):cd <scraper dir> && ./run.sh
- If the internal scraper database should be cleared, either delete
--no-scrape
: skip content scraping--no-post
: skip content upload--post-limit <N>
: only upload the first N items to Drupal--debug
: show debug info--db <db>
: specify database file (default:db/scraper.sqlite
)--kill-db
: delete database before start--events-only
: only post events to Drupal--pubs-only
: only post pubs to Drupal--show-pending
: print number of pending things--only <scraper>
: only run specified scraper (seescrapers.txt
)
- All uploaded items are unpublished by default.
- Date limit for articles is January 1, 2014 and January 1, 2010 for events and publications.
- APAARI Events RSS feed does not include parseable event dates