ingestors
extract useful information from documents of different types in
a structured standard format. It retains folder structures across directories,
compressed archives and emails. The extracted data is formatted as Follow the
Money (FtM) entities, ready for import into Aleph, or processing as an object
graph.
Supported file types:
- Plain text
- Images
- Web pages, XML documents
- PDF files
- Emails (Outlook, plain text)
- Archive files (ZIP, Rar, etc.)
Other features:
- Extendable and composable using classes and mixins.
- Generates FollowTheMoney objects to a database as result objects.
- Lightweight worker-style support for logging, failures and callbacks.
- Throughly tested.
For local development with a virtualenv:
python3 -mvenv .env
source .env/bin/activate
pip install -r requirements.txt
git pull --rebase
make build
make test
source .env/bin/activate
bump2version {patch,minor,major} # pick the appropriate one
git push --atomic origin $(git rev-parse --abbrev-ref HEAD) $(git describe --tags --abbrev=0)
Ingestors are usually called in the context of Aleph. In order to run them stand-alone, you can use the supplied docker compose environment. To enter a working container, run:
make build
make shell
Inside the shell, you will find the ingestors
command-line tool. During
development, it is convenient to call its debug mode using files present
in the user's home directory, which is mounted at /host
:
ingestors debug /host/Documents/sample.xlsx