/thebeast

Primary LanguagePythonMIT LicenseMIT

The Beast

The beast is an experimental, flexible, declarative-oriented toolkit to read machinereadable data from the various sources and transform them into follow the money entities.

Do not rely on this one until it is out of alpha. Everything is very volatile

More reading

The FTM proposal: alephdata/followthemoney#717

The sample mapping with tons of comments to make you understand an idea better (beware, it's just an example, format is the subject to change): https://github.com/dchaplinsky/thebeast/blob/main/thebeast/tests/sample/mappings/ukrainian_mps.yaml

Validator for the mappings in json schema format (again, work in progress and tons of comments): https://github.com/dchaplinsky/thebeast/blob/main/thebeast/conf/mapping_validator.json

First proposal of the mapping (obsolete, but can give you a better idea) https://gist.github.com/dchaplinsky/8021b530ea7e44c9443afcc3318042fd

Current status

High priority

  • Ingest from databases (mongo, postgres) using SQLAlchemy or PeeWee
  • Tests for the databases ingest
  • Basic CLI
  • Signals on exceptions and policy for the incorrectly parsed entity values (drop, drop all, drop entity, reraise)
  • Tests for the signals
  • Stats collector (number of signals of each type, number of invalid entities, etc)
  • Packaging (partially done in packaging_and_spark_integration branch)
  • Documentation (@legless, your notes will be very valuable)

Low priority

  • Advanced ingest routines: regex validation to discard values that do not pass the test?
  • Tests for the resolver wrappers

Done

  • Basic ingest for json/jsonlines/csv, both local and remote, compressed or not, singular or multiple files
  • Tests for the basic ingest
  • Mapping reader
  • Tests for mapping reader
  • Basic digest routines
  • Tests for basic digest routines
  • Advanced ingest routines: constant entities (think Country or Organization)
  • Advanced ingest routines: backreferencing (think talking from subcollections to parent items)
  • Advanced ingest routines: nested collections (think parsing involved JSON)
  • Advanced ingest routines: templates (think combining fields when setting the entity field)
  • Advanced ingest routines: multiple values for the entity property
  • Advanced ingest routines: split string into multiple values
  • Advanced ingest routines: full entity validation and red/green sorting
  • Advanced ingest routines: augmentations/transformations
  • Advanced ingest routines: records transformations
  • Tests for records transformations
  • Tests for the individual resolvers
  • Tests for digest routines
  • Advanced digest routines: multiprocessing
  • Tests for advanced digest routines
  • Basic dump routines (stdout/files)
  • Basic dump routines: statements
  • Tests for basic dump routines
  • Tests for basic dump routines: statements
  • Remove inflate/deflate and pass dicts rather than entities between digest and dump
  • Python 3.11 support (https://github.com/dchaplinsky/thebeast/actions/runs/3802499820/jobs/6468041810, ICRAR/ijson#80)

Running tests

pip install -r requirements.txt
python -m pytest

Run using Docker

/bin/ directory contains scripts to run Beast inside Docker container.

Use /bin/run data/mapping.yaml to run Beast with selected mapping. Note: mapping and source file(s) must be in Beast root (sub-)directory. E.g. ./data/mapping.yaml You can't point Beast to a file outside it's root directory.

Use /bin/tests to run tests.

Use /bin/black to run black to format source files before contributing a pull request.