Allow indexing/including unrelated datasets
Closed this issue · 2 comments
We want to be able to load offshoreleaks and other stuff like this into the default
collection at run-time. Probably means detaching our definition of the datasets from the index.json
spec a bit at some point.
In order to do this, I want to introduce a manifest.yml
to describe all the datasets in the system. This would a) reference the OpenSanctions index and how often to fetch that, b) be able to add more sources that are not part of OpenSanctions.
Here's a proposed format:
opensanctions:
index: https://data.opensanctions.org/datasets/latest/index.json
scope: default
schedule: "*/30 * * * *"
sources:
icij_offshoreleaks:
title: ICIJ OffshoreLeaks
entities_url: https://data.opensanctions.org/contrib/icij-offshoreleaks/full-oldb.json
schedule: null
collections:
- all
- offshore
local_dataset1:
title: My local fraudsters
schedule: "* 30 1 * * *"
# Apply an FtM namespace:
namespace: true
collections:
- all
- fraud
queries:
csv_url: file:///home/pudo/data/fraudsters.csv
entities: (see https://docs.alephdata.org/developers/mappings)
This would have the following effects:
a) Load all OpenSanctions data inside the default
dataset, checking for updates every 30 minutes
b) Load the ICIJ OffshoreLeaks database once and include those entities in search results for the collections all
and offshore
.
c) Generate FtM objects from a local CSV file and load those entities into a new dataset once per night.
@pudo Just a naming thing: I would move opensanctions below the sources scope, because it's a souce, or alternatively rename sources into additional_sources, because they are additional to the main opsensanctions source.
Works well now, closing.