ministryofjustice/data-catalogue

Spike: investigate how to deploy custom connectors

MatMoore opened this issue · 6 comments

We've written custom connectors for

https://github.com/ministryofjustice/datahub-custom-api-source
https://github.com/ministryofjustice/datahub-custom-domain-source

Currently we've just run them locally via DataHub CLI, but if we want to keep using them we should find out how to deploy them so that we can use them in the DataHub UI and schedule ingestions.

Looks like if we want to include custom ingestion libraries, we would need to create our own GMS docker image, that extends the linkedin/datahub-gms one to install our custom packages

Then we can override datahub-gms.image.repository in our helm values.

Via slack

datahub-actions is the image created by Acryl team that bundle the ingestion libraries and potentially some proprietary code, that’s why you don’t see it the Dockerfile in the open source repo. If you run your ingestion through the ui, the datahub-actions image will be responsible for the metadata ingestion.
That said, when you a lot of non standard ingestion, i.e. custom source, custom transformers, you can bundle these custom plugins + metadata ingestion library to a new image, deploy & run in your own way

Alternatively, we could keep the build as it is and run the ingestion outside of datahub via the CLI and github actions.

This seems like it would be the easier option, but at the cost of the ingestions not all being visible in one place.

If we created our own build I'm assuming the images we would need to extend would be the datahub-gms one and possibly the datahub frontend one(?)

Relevant build commands:

For reference here is the datahub-ingestion Dockerfile https://github.com/datahub-project/datahub/blob/master/docker/datahub-ingestion/Dockerfile

I'm not sure how this relates to the gms image but there is this line that brings in the python packages

RUN uv pip install --no-cache -e ".[base,datahub-rest,datahub-kafka,snowflake,bigquery,redshift,mysql,postgres,hive,clickhouse,glue,dbt,looker,lookml,tableau,powerbi,superset,datahub-business-glossary]"

Presumably we can add a similar line to install any custom packages

I'm not 100% clear on how the GMS interacts with the python ingestion code.

It seems that the first thing we would need to customise would be the ingestion-cron image, deployed in this chart https://github.com/acryldata/datahub-helm/tree/89c92c8ac73b4dc371d647216d60dff28cc7c9ae/charts/datahub/subcharts/datahub-ingestion-cron

Not sure if that is it, or if there are other things to modify.

Right now I'm leaning towards the GHA option:

  • Seems cleaner to keep the DH deployments unmodified
  • It's fairly easy to independently run and test ingestions within a GHA
  • Easier to go from "new ingestion tested locally on CLI" to "committed as a new GHA" (easier than adding to DH)
  • Running on a schedule is still an option (subject to minor caveats about when it exactly runs)
  • Easier to accept an ingestion contributed by another team

Accepting that there's a drawback to having ingestions in 2 places, another downside is the GHA actions setup could get messy or confusing when we consider which DataHub instance is the target for the ingestion.

Decision: we will use Github actions to run any custom connectors

  • We already have https://github.com/ministryofjustice/data-catalogue-metadata set up to communicate with Datahub
  • Github Actions can run jobs on a schedule, as long as they aren't super time sensitive
  • CLI ingestions are visible in datahub, but read only. We accept the limitation that we won't be able to manage the scheduling of all ingestions in one place.

Follow up task: ministryofjustice/find-moj-data#291