ingestion: A Python repository from Digital Public Library of America Attic

The DPLA Ingestion System

Build Status

Documentation

Please see the release notes regarding changes and upgrade steps.

Setting up the ingestion server:

Install Python 2.7 if not already installed (http://www.python.org/download/releases/2.7/);

Install PIP (http://pip.readthedocs.org/en/latest/installing.html);

Install the ingestion subsystem;

$ pip install --no-deps --ignore-installed -r requirements.txt

Configure an akara.ini file appropriately for your environment;

[Akara]
Port=<port for Akara to run on>
; Recommended LogLevel is one of DEBUG or INFO
LogLevel=<priority>

[CouchDb]
Url=<URL to CouchDB instance, including trailing forward-slash>
Username=<CouchDB username>
Password=<CouchDB password>
SyncQAViews=<True or False; consider False on production>
; Recommended LogLevel is INFO for production; defaults to INFO if not set
LogLevel=<priority>

[Twofishes]
BaseUrl=<URL to Twofishes server, required for geo-enrichments>

[Rackspace]
Username=<Rackspace username>
ApiKey=<Rackspace API key>
DPLAContainer=<Rackspace container for bulk download data>
SitemapContainer=<Rackspace container for sitemap files>

[APITokens]
NYPL=<Your NYPL API token>

[Sitemap]
SitemapURI=<Sitemap URI>
SitemapPath=<Path to local directory for sitemap files>

[Alert]
To=<Comma-separated email addresses to receive alert email>
From=<Email address to send alert email>

[Enrichment]
QueueSize=4
ThreadCount=4

Merge the akara.conf.template and akara.ini file to create the akara.conf file;

$ python setup.py install

Set up and start the Akara server;

$ akara -f akara.conf setup
$ akara -f akara.conf start

Build the database views;

$ python scripts/sync_couch_views.py dpla
$ python scripts/sync_couch_views.py dashboard
$ python scripts/sync_couch_views.py bulk_download

Testing the ingestion server:

You can test it with this set description from Clemson;

$ curl "http://localhost:8889/oai.listrecords.json?endpoint=http://repository.clemson.edu/cgi-bin/oai.exe&oaiset=jfb&limit=10"

If you have the endpoint URL but not a set id, there's a separate service for listing the sets;

$ curl "http://localhost:8889/oai.listsets.json?endpoint=http://repository.clemson.edu/cgi-bin/oai.exe&limit=10"

To run the ingest process run the setup.py script, if not done so already, initialize the database and database views, then feed it a source profile (found in the profiles directory);

$ python setup.py install
$ python scripts/sync_couch_views.py dpla
$ python scripts/sync_couch_views.py dashboard
$ python scripts/ingest_provider.py profiles/clemson.pjs

License

This application is released under a AGPLv3 license.

dpla-attic/ingestion

The DPLA Ingestion System

Build Status

Documentation

License