Spike: investigate how to deploy custom connectors
MatMoore opened this issue · 6 comments
We've written custom connectors for
https://github.com/ministryofjustice/datahub-custom-api-source
https://github.com/ministryofjustice/datahub-custom-domain-source
Currently we've just run them locally via DataHub CLI, but if we want to keep using them we should find out how to deploy them so that we can use them in the DataHub UI and schedule ingestions.
Looks like if we want to include custom ingestion libraries, we would need to create our own GMS docker image, that extends the linkedin/datahub-gms one to install our custom packages
Then we can override datahub-gms.image.repository
in our helm values.
Via slack
datahub-actions is the image created by Acryl team that bundle the ingestion libraries and potentially some proprietary code, that’s why you don’t see it the Dockerfile in the open source repo. If you run your ingestion through the ui, the datahub-actions image will be responsible for the metadata ingestion.
That said, when you a lot of non standard ingestion, i.e. custom source, custom transformers, you can bundle these custom plugins + metadata ingestion library to a new image, deploy & run in your own way
Alternatively, we could keep the build as it is and run the ingestion outside of datahub via the CLI and github actions.
This seems like it would be the easier option, but at the cost of the ingestions not all being visible in one place.
If we created our own build I'm assuming the images we would need to extend would be the datahub-gms one and possibly the datahub frontend one(?)
Relevant build commands:
- ./gradlew :metadata-service:war:build (config here https://github.com/datahub-project/datahub/blob/771ab0d4a866ca5da4e8499e08bf2f6589c90879/metadata-service/war/build.gradle)
- ./gradlew :datahub-frontend:build (config here https://github.com/datahub-project/datahub/blob/771ab0d4a866ca5da4e8499e08bf2f6589c90879/datahub-frontend/build.gradle#L55)
For reference here is the datahub-ingestion Dockerfile https://github.com/datahub-project/datahub/blob/master/docker/datahub-ingestion/Dockerfile
I'm not sure how this relates to the gms image but there is this line that brings in the python packages
RUN uv pip install --no-cache -e ".[base,datahub-rest,datahub-kafka,snowflake,bigquery,redshift,mysql,postgres,hive,clickhouse,glue,dbt,looker,lookml,tableau,powerbi,superset,datahub-business-glossary]"
Presumably we can add a similar line to install any custom packages
Note: datahub docker images are built using python 3.10, whereas we are targeting 3.11 https://github.com/datahub-project/datahub/blob/08731055ba1df94a1f7e52b23c5d6e257b1f0c79/docker/datahub-ingestion-base/Dockerfile#L27
I'm not 100% clear on how the GMS interacts with the python ingestion code.
It seems that the first thing we would need to customise would be the ingestion-cron image, deployed in this chart https://github.com/acryldata/datahub-helm/tree/89c92c8ac73b4dc371d647216d60dff28cc7c9ae/charts/datahub/subcharts/datahub-ingestion-cron
Not sure if that is it, or if there are other things to modify.
Right now I'm leaning towards the GHA option:
- Seems cleaner to keep the DH deployments unmodified
- It's fairly easy to independently run and test ingestions within a GHA
- Easier to go from "new ingestion tested locally on CLI" to "committed as a new GHA" (easier than adding to DH)
- Running on a schedule is still an option (subject to minor caveats about when it exactly runs)
- Easier to accept an ingestion contributed by another team
Accepting that there's a drawback to having ingestions in 2 places, another downside is the GHA actions setup could get messy or confusing when we consider which DataHub instance is the target for the ingestion.
Decision: we will use Github actions to run any custom connectors
- We already have https://github.com/ministryofjustice/data-catalogue-metadata set up to communicate with Datahub
- Github Actions can run jobs on a schedule, as long as they aren't super time sensitive
- CLI ingestions are visible in datahub, but read only. We accept the limitation that we won't be able to manage the scheduling of all ingestions in one place.
Follow up task: ministryofjustice/find-moj-data#291