Is there a way to use /data/datasets/index.json instead of https://data.opensanctions.org/datasets/latest/index.json?

Question

Is there a way to use /data/datasets/index.json instead of https://data.opensanctions.org/datasets/latest/index.json?

landonstewart opened this issue 2 years ago · 2 comments

In manifest.yml am I on the right track to use the local /data/datasets/ generated by opensanctions/opensanctions/ instead of the index available at https://data.opensanctions.org/datasets/latest/index.json?

Something like this: /app/manifests/manifest.yml (??)

schedule: "*/30 * * * *"
catalogs:
  - path: /data/datasets/index.json
    scope: all

When I try this nothing seems to happen when running yente. After looking at manifest.py it seems that url: is required here. If I use the default configuration it works and populates elasticsearch but not with the custom one above. With manifests.yml above it just starts and sits there with no fetching/indexing.

TLDR; I guess what I'm asking is how does one use the local datasets/ created by a locally running https://github.com/opensanctions/opensanctions instead of fetching all the data from OpenSanctions.org?

I'm running it like this (docker swarm):

  yente:
    image: ghcr.io/opensanctions/yente:latest
    environment:
      YENTE_ENDPOINT_URL: https://<url>
      YENTE_MANIFEST: /app/manifests/manifest.yml
      YENTE_ELASTICSEARCH_URL: http://elasticsearch:9200
      YENTE_STATEMENT_API: "false"
      YENTE_UPDATE_TOKEN: <randomstuff>
    volumes:
      - /mnt/gfs/OpenSanctions/data:/data
      - /mnt/gfs/OpenSanctions/manifest.yml:/app/manifests/manifest.yml
    networks:
      - traefik_public
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]
      restart_policy:
        condition: on-failure
      labels:
        - ...

Answer 1 · 2022-12-06T06:53:32.000Z

Since your question also boils down to "How can I avoid paying @pudo for the maintenance of the dataset", I'll leave this one open to the community. Try file URLs, and make sure you understand which version of yente you're running.

Answer 2 · 2022-12-07T05:21:12.000Z

I appreciate the hints here even though I haven't got it to work. I hadn't even considered cutting OpenSanctions.org out here but I guess that would have been the effect.

In the end I just recommended, to my upstream, the Bulk Data model and running Yente internally for compliance/regulatory reasons.