ADSimportpipeline

Overview

Coordinates ingest of a full ADS record. The workflow is as follows.

Read bibcodes from file
Query mongo for changed records. A changed record is one whose record.JSON_fingerprint is different than that parsed from disk
Call ads.ADSExports.ADSRecords to consolidate each record's data (ADS-classic)
Parses resulting xmlobject to a native python structure via xmltodict.py
Enforces a set schema that is documented in schema.json
Merges any repeated metadata blocks
Insert/update record in mongodb. Insert if it is new, update if that bibcode or one of its alternate bibcodes exists.

The workflow is initiated by invoking run.py.

Invoking run.py --async publishes the [(bibcode, fingerprint),...] records to rabbitmq.
Workers that consume these messages are defined in pipeline/psettings.py and pipeline/workers.py.
Workers are controlled via a master process in pipeline/ADSimportpipeliny.py.
Workers perform their tasks independently and (optionally) concurrently.
Workers expect and return JSON.

pika
rabbitmq
ADSExports
pymongo + mongo
Note: The rabbitmq server should be configured for frame_max=512000
Note: pika should be configured with frame_max=512000 (seemingly must be changed in spec.py in addition to normal connection definition)