Register Sources BODS is a shared library for the OpenOwnership Register project. It is designed for use with any Beneficial Ownership Data Standard (BODS) format data source.
The primary purposes of this library are:
- Providing typed objects for the JSON-line data. It makes use of the dry-types and dry-struct gems to specify the different object types allowed in the data returned.
- Persisting the BODS records using Elasticsearch. This functionality includes creating a mapping for indexing the possible fields observed as well as functions for storage and retrieval.
- Publishing BODS statements to a designated Kinesis stream.
This library does not include transformation to BODS format of other data standards. That is instead left as the purpose of the Register Transformers.
The data standard is BODS 0.2.
Install and boot Register.
Configure your environment using the example file:
cp .env.example .env
Run the tests:
docker compose run sources-bods test
To local ingest xx.jsonl
file into raw-xx
index, optionally publishing to xx-dev
Kinesis stream:
docker compose run sources-bods ingest-local data/imports/xx.jsonl raw-xx
docker compose run sources-bods ingest-local data/imports/xx.jsonl raw-xx xx-dev
To local transform xx.jsonl
file from raw-xx
index into bods_v2_xx_dev1
index, optionally publishing to bods-xx-dev
Kinesis stream:
docker compose run sources-bods transform-local data/imports/xx.jsonl raw-xx bods_v2_xx_dev1
docker compose run sources-bods transform-local data/imports/xx.jsonl raw-xx bods_v2_xx_dev1 bods-xx-dev
Optionally, 0
can be appended to the command to disable resolving via Open Corporates. In case disabling is required but publishing to a Kinesis stream isn't, '' 0
can be used as the final two arguments.
To bulk ingest raw/xx/
S3 prefix into raw-xx
index, optionally publishing to xx-dev
Kinesis stream:
docker compose run sources-bods ingest-bulk raw/xx/ raw-xx
docker compose run sources-bods ingest-bulk raw/xx/ raw-xx xx-dev
To bulk transform raw/xx/
S3 prefix from raw-xx
index into bods_v2_xx_dev1
index, optionally publishing to bods-xx-dev
Kinesis stream:
docker compose run sources-bods transform-bulk raw/xx/ raw-xx bods_v2_xx_dev1
docker compose run sources-bods transform-bulk raw/xx/ raw-xx bods_v2_xx_dev1 bods-xx-dev
Optionally, 0
can be appended to the command to disable resolving via Open Corporates. In case disabling is required but publishing to a Kinesis stream isn't, '' 0
can be used as the final two arguments.
In order to perform the monthly bulk data tasks, it is necessary to import the latest raw data, process the raw data to turn it into BODS statements, and export the BODS statements to compressed files available for download internally and from the Register website. These tasks span multiple repositories and commands.
All of these commands should be run on the Register server in EC2 (bods-register
).
Ingester OC, Ingester PSC, Ingester DK, and Ingester SK steps can be done in any order, or in parallel.
https://github.com/openownership/register-ingester-oc?tab=readme-ov-file#helper-script
Checkout the latest code and build via Docker:
cd ~/register-ingester-oc/
git checkout main
git pull
docker compose build
Ingest the bulk data, where YYYY-MM-DD
is the date the Open Corporates FTP files were published:
docker compose run ingester-oc ingest-bulk YYYY-MM-DD
This will ask you for the FTP password, 3 times.
Note that there is also a streaming ingester service running on Heroku (register-ingester-psc-prd
). It might not be necessary to complete the rest of this step if that process is all working correctly without missed data (not currently the case).
Checkout the latest code and build via Docker:
cd ~/register-ingester-psc/
git checkout main
git pull
docker compose build
Ingest the bulk data:
docker compose run ingester-psc ingest-bulk
https://github.com/openownership/register-ingester-dk?tab=readme-ov-file#usage
Checkout the latest code and build via Docker:
cd ~/register-ingester-dk/
git checkout master
git pull
docker compose build
Ingest the bulk data:
docker compose run ingester-dk ingest-bulk
https://github.com/openownership/register-ingester-sk?tab=readme-ov-file#usage
Checkout the latest code and build via Docker:
cd ~/register-ingester-sk/
git checkout main
git pull
docker compose build
Ingest the bulk data:
docker compose run ingester-sk ingest-bulk
Transformer PSC, Transformer DK, and Transformer SK steps can be done in any order, or in parallel, once their dependencies are satisfied.
Transformer PSC step depends on Ingester OC and Ingester PSC steps.
https://github.com/openownership/register-transformer-psc?tab=readme-ov-file#bulk-data
Note that there is also a streaming transformer service running on Heroku (register-transformer-psc-prd
). It might not be necessary to complete the rest of this step if that process is all working correctly and no additional bulk data had to be imported.
Checkout the latest code and build via Docker:
cd ~/register-transformer-psc/
git checkout main
git pull
docker compose build
Transform the bulk data, where YYYY
and MM
are the current year and month to be transformed:
docker compose run transformer-psc transform-bulk raw_data/source=PSC/year=YYYY/month=MM/
Transformer DK step depends on Ingester OC and Ingester DK steps.
https://github.com/openownership/register-transformer-dk?tab=readme-ov-file#usage
Checkout the latest code and build via Docker:
cd ~/register-transformer-dk/
git checkout master
git pull
docker compose build
Transform the bulk data, where YYYY
and MM
are the current year and month to be transformed:
docker compose run transformer-dk transform-bulk raw_data/source=DK/year=YYYY/month=MM/
Transformer SK step depends on Ingester OC and Ingester SK steps.
https://github.com/openownership/register-transformer-sk?tab=readme-ov-file#usage
Checkout the latest code and build via Docker:
cd ~/register-transformer-sk/
git checkout main
git pull
docker compose build
Transform the bulk data, where YYYY
and MM
are the current year and month to be transformed:
docker compose run transformer-sk transform-bulk raw_data/source=SK/year=YYYY/month=MM/
Download S3 files and all subsequent Combiner steps depend on Transformer steps being completed.
https://github.com/openownership/register-sources-bods
openownership/register#265 (comment)
Checkout the latest code and build via Docker:
cd ~/register-sources-bods/
git checkout main
git pull
docker compose build
Download the files:
sync-clones
Combine the files:
docker compose run sources-bods combine data/imports/source=PSC/ data/exports/prd/ psc
Combine the files:
docker compose run sources-bods combine data/imports/source=DK/ data/exports/prd/ dk
Combine the files:
docker compose run sources-bods combine data/imports/source=SK/ data/exports/prd/ sk
Combine the files:
docker compose run sources-bods combine-all data/exports/prd/
Upload the files:
sync-exports-tx
Check that the All compressed file appears on the Register website automatically:
https://register.openownership.org/download
Announce the availability of bulk data exports internally on Slack in #oo-technology
channel.