Coordinates ingest of a full ADS record. The workflow is as follows.
- Read bibcodes from file
- Query mongo for changed records. A changed record is one whose
record.JSON_fingerprint
is different than that parsed from disk - Call ads.ADSExports.ADSRecords to consolidate each record's data (ADS-classic)
- Parses resulting xmlobject to a native python structure via
xmltodict.py
- Enforces a set schema that is documented in
schema.json
- Merges any repeated metadata blocks
- Insert/update record in mongodb. Insert if it is new, update if that bibcode or one of its alternate bibcodes exists.
The workflow is initiated by invoking run.py
.
- Invoking
run.py --async
publishes the[(bibcode, fingerprint),...]
records to rabbitmq. - Workers that consume these messages are defined in
pipeline/psettings.py
andpipeline/workers.py
. - Workers are controlled via a master process in
pipeline/ADSimportpipeliny.py
. - Workers perform their tasks independently and (optionally) concurrently.
- Workers expect and return JSON.
- pika
- rabbitmq
- ADSExports
- pymongo + mongo
- Note: The rabbitmq server should be configured for frame_max=512000
- Note: pika should be configured with frame_max=512000 (seemingly must be changed in spec.py in addition to normal connection definition)