Enhance Kedro Deployment
Opened this issue · 1 comments
Overview
This parent issue tracks our ongoing efforts to improve Kedro deployment. Based on user research, we aim to address key challenges by enhancing plugins, refining documentation, and developing new features to better support our community's deployment needs.
Research Initiatives
We began our research in October 2024 with a user survey and follow-up interviews:
- Survey: We received 43 responses (https://www.surveys.online/jfe/form/SV_8pGfndbdtbrfbaS).
- User Interviews: We conducted 11 in-depth interviews, recorded and analysed using Dovetail.
- Synthesis and Findings: All insights and synthesis are consolidated on the Kedro Miro Board. Research Playback: #4325.
Key Insights and Challenges
- Plugin Compatibility: Users relying on Kedro's connection plugins for third-party platforms face outdated or compatibility issues, making the conversion of Kedro nodes into platform components challenging and leading them to seek alternative solutions. #4318
- Node Grouping Functionality: Users value merging multiple nodes into a single task on the deployment platform for clarity and efficiency, but current plugins provides limited functionality. #4319
- Kedro-Databricks Integration: Users deploy Kedro projects on Databricks in two ways: longer methods that generate a .whl file on DBFS, and quicker methods that make project code directly accessible in Databricks repo with the options of running in notebooks.
- Support for Online Inference: Users are increasingly seeking to deploy online inference pipelines (such as LLM calls) in isolated environments for real-time predictions; however, Kedro offers limited support for this functionality.
- Container Deployment Efficiency: Users often deploy with Docker images, but for larger projects, using a single container for the entire project can be inefficient.
Next Steps
We will continue to address these insights through targeted improvements and new feature development. This issue will track the progress of all related tasks and discussions, with updates and deliverables shared as they are completed.
Feel free to contribute, discuss, or raise additional concerns related to Kedro deployment in the comments below.
Back in 2022, a small team and I pitched (and prototyped) Exedra (probably misspelled Exidra) to provide an intermediate representation for deploying to orchestrators, to help solve the issue of maintaining many deployment plugins. It's probably somewhat along the lines of some of the discussions around separating the node grouping and the deployment pieces. In that case, Kedro-Exedra would manage node grouping, while Exedra handles deployment. It looked at and leveraged the structural similarities between a lot of the (then-modern) ML workflow tools, including showing Kubeflow Pipelines/Vertex AI, Azure ML, and Sagemaker. IIRC I also made pretty clear claims that you should just deploy at the modular pipeline level, and that nobody needs to deploy per-node.
There have also been other attempts to abstract the second part, and to create a more unified deployment language, such as Couler. This could have also helped easy the maintenance burden, but it also never took off. (Also, while the project creator was very open to collaborating and having people add other orchestration backends/generally improving the project, there wasn't buy-in to invest in this from QB.) Old related issue: #2058
This was based on what IMO was the best way to deploy almost 3 years ago now. If it is still important to be able to deploy to all of these, it's probably still a good starting point. It's also worth noting that a lot of these focus on the data science side, not the data engineering side.
On top of that, there's a question of what is the best way to deploy a Kedro pipeline (e.g. as part of a data platform or broader ecosystem). Let's say you're an open-source user, and you want to be able to productionize your pipeline, and you have no strong requirements around what tool to use to do that. I think this is more of an open question; also happy to hear if others have seen a best option in some of these cases. For example, @gtauzin recently went through this journey, but I'm sure many others have, too.
It's probably also worth looking at a lot of the existing work in this space. For example, especially in the data engineering space, what is the best way to deploy something like dbt? What do Airflow, Dagster, , etc. deployments look like for these? Do any stand out? What are the killer features?
Pieces of discussion with @astrojuanlu:
I remember discussing couler some time last year. I checked again and indeed the project seems kind of dead...
Yeah, it's not really being worked on. Realistically, to my chagrin, most people don't need something like Couler. I actually brought up my interest in this space in my past work creating unifying abstractions, but was told that orchestrators are all kinda the same, and people just pick one and it works; it's probably true to some extent...
(Even for Couler, it basically just is build for Argo, because that's what the people using it use—so much for a unifying abstraction. :rolling_on_the_floor_laughing:)
the other user journey is: "I work in a team that's part of a company which already chose an MLOps platform/Cloud provider. how do I get Kedro there?". so we need to acknowledge that some users don't really have a choice, and lack of options prevent them from adopting Kedro.
Yes, 100% agree that this is a challenge, and it's also like the one case where you want that Exedra- or Couler-esque abstraction.
and about the comparison with dbt, I 100 % agree. despite the differences, we can probably learn a lot from them.
It's also worth noting, if it doesn't make sense to deploy dbt to one of these tools, it quite possibly also doesn't make sense to deploy Kedro data engineering pipelines there. If Kedro is going to support these workflows, then you need to be able to tell the deployment story across DE and DS—will it require multiple deployment tools and plugins, or do you focus on those that support the full story like Airflow, Dagster, maybe Vertex AI? (Quick search I'm not seeing great resources on deploying dbt in Sagemaker or Azure ML Pipelines, but correct me if I'm missing something)