Is there a way to use /data/datasets/index.json instead of https://data.opensanctions.org/datasets/latest/index.json?
landonstewart opened this issue · 2 comments
In manifest.yml
am I on the right track to use the local /data/datasets/ generated by opensanctions/opensanctions/
instead of the index available at https://data.opensanctions.org/datasets/latest/index.json
?
Something like this: /app/manifests/manifest.yml
(??)
schedule: "*/30 * * * *"
catalogs:
- path: /data/datasets/index.json
scope: all
When I try this nothing seems to happen when running yente
. After looking at manifest.py
it seems that url:
is required here. If I use the default configuration it works and populates elasticsearch but not with the custom one above. With manifests.yml
above it just starts and sits there with no fetching/indexing.
TLDR; I guess what I'm asking is how does one use the local datasets/ created by a locally running https://github.com/opensanctions/opensanctions
instead of fetching all the data from OpenSanctions.org?
I'm running it like this (docker swarm):
yente:
image: ghcr.io/opensanctions/yente:latest
environment:
YENTE_ENDPOINT_URL: https://<url>
YENTE_MANIFEST: /app/manifests/manifest.yml
YENTE_ELASTICSEARCH_URL: http://elasticsearch:9200
YENTE_STATEMENT_API: "false"
YENTE_UPDATE_TOKEN: <randomstuff>
volumes:
- /mnt/gfs/OpenSanctions/data:/data
- /mnt/gfs/OpenSanctions/manifest.yml:/app/manifests/manifest.yml
networks:
- traefik_public
deploy:
mode: replicated
replicas: 1
placement:
constraints: [node.role == manager]
restart_policy:
condition: on-failure
labels:
- ...
Since your question also boils down to "How can I avoid paying @pudo for the maintenance of the dataset", I'll leave this one open to the community. Try file URLs, and make sure you understand which version of yente you're running.
I appreciate the hints here even though I haven't got it to work. I hadn't even considered cutting OpenSanctions.org out here but I guess that would have been the effect.
In the end I just recommended, to my upstream, the Bulk Data model and running Yente internally for compliance/regulatory reasons.