/register-ingester-psc

Register Ingester PSC is an application designed for use with the People with significant control (PSC) data published by Companies House in the UK

Primary LanguageRubyApache License 2.0Apache-2.0

Register Ingester PSC

Register Ingester PSC is a data ingester for the OpenOwnership Register project. It processes bulk data published about People with Significant Control (PSC) published by Companies House in the UK, and ingests records into Elasticsearch. Optionally, it can also publish new records to AWS Kinesis. It uses raw records only, and doesn't do any conversion into the Beneficial Ownership Data Standard (BODS) format.

Installation

Install and boot Register.

Configure your environment using the example file:

cp .env.example .env
  • PSC_STREAM: AWS Kinesis stream to which to publish new records (optional)
  • PSC_STREAM_API_KEY: PSC Stream API registration key (optional; only necessary if ingesting via a stream rather than snapshots)

Create the Elasticsearch indexes:

docker compose run ingester-psc create-indexes

Testing

Run the tests:

docker compose run ingester-psc test

Usage

There are now three options:

  • ingest via snapshots by using the helper script
  • ingest via snapshots by running the commands step-by-step
  • ingest via a stream by running the commands step-by-step (not fully functional)

Snapshots using the helper script

To ingest the bulk data from a snapshot (published daily):

docker compose run ingester-psc ingest-bulk

Snapshots step-by-step

Decide on an import ID relating to the data to download, e.g. 2023-10-06. This is then used in subsequent commands.

Discover snapshots by retrieving the list of snapshots:

docker compose run ingester-psc discover-snapshots 2023_10_06

Ingest snapshots by iterating through the list of files uploaded to the designated prefix with the import ID, and ingest them into Elasticsearch:

docker compose run ingester-psc ingest-snapshots 2023_10_06

Stream step-by-step (not fully functional)

Connect to the PSC Stream API, consume any new records, and ingest them into Elasticsearch (PSC_STREAM_API_KEY must be set):

docker compose run ingester-psc ingest-stream

Or to connect to the PSC Stream API using stream position STREAM_POSITION (if valid and not too old):

docker compose run ingester-psc ingest-stream <STREAM_POSITION>