opensanctions/yente

Allow indexing/including unrelated datasets

Closed this issue · 2 comments

pudo commented

We want to be able to load offshoreleaks and other stuff like this into the default collection at run-time. Probably means detaching our definition of the datasets from the index.json spec a bit at some point.

In order to do this, I want to introduce a manifest.yml to describe all the datasets in the system. This would a) reference the OpenSanctions index and how often to fetch that, b) be able to add more sources that are not part of OpenSanctions.

Here's a proposed format:

opensanctions:
  index: https://data.opensanctions.org/datasets/latest/index.json
  scope: default
  schedule: "*/30 * * * *"
sources:
  icij_offshoreleaks:
    title: ICIJ OffshoreLeaks
    entities_url: https://data.opensanctions.org/contrib/icij-offshoreleaks/full-oldb.json
    schedule: null
    collections:
      - all
      - offshore
  local_dataset1:
    title: My local fraudsters
    schedule: "* 30 1 * * *"
    # Apply an FtM namespace:
    namespace: true
    collections:
      - all
      - fraud
    queries:
      csv_url: file:///home/pudo/data/fraudsters.csv
      entities: (see https://docs.alephdata.org/developers/mappings)

This would have the following effects:

a) Load all OpenSanctions data inside the default dataset, checking for updates every 30 minutes
b) Load the ICIJ OffshoreLeaks database once and include those entities in search results for the collections all and offshore.
c) Generate FtM objects from a local CSV file and load those entities into a new dataset once per night.

@pudo Just a naming thing: I would move opensanctions below the sources scope, because it's a souce, or alternatively rename sources into additional_sources, because they are additional to the main opsensanctions source.

pudo commented

Works well now, closing.