celery/ceps

Kubernetes operator for Celery

jmdacruz opened this issue ยท 25 comments

This proposal is about having a Kubernetes operator (see here and here). The scope of the operator would be the following:

  • Defining a CRD for a CeleryApplication. This resource would contain the configuration for the cluster (e.g., container resource requests/limits, number of replicas), Celery configuration (e.g., broker and result backend configuration), Docker image with the code and launch parameters (e.g., location of the code inside the container, virtualenv)
    • As an alternative, the CRD could include a URL for downloading a Python package with the code of the application, instead of the Docker image.
  • The operator itself would manage the control loop of the CeleryApplication CRD, and would spawn a Kubernetes Deployment for the cluster, and also a Deployment for running flower. It would also create a Kubernetes Service so that we can access the flower UI/API
  • Broker and result backend configuration would be out of scope of the operator, these have to be created before hand. A set of CeleryApplication resources should be able to use a shared broker and result backend, but they can pick their own broker configuration too.

This idea is inspired by the Flink Kubernetes Operator developed by Lyft: https://github.com/lyft/flinkk8soperator

Yes, this should happen.
It means that we need to create a Controller implementation (See #19) for k8s.

Initial attempt at implementing this (still very much a proof-of-concept): https://github.com/jmdacruz/celery-k8s-operator

How do you imagine the integration with Celery 5 be like?

I honestly haven't been following the development of 5.X that close, where can I get a glimpse on the biggest changes? Worst case scenario, if there are breaking changes, the operator supports versioning: The CeleryApplication resource includes an attribute for the celeryVersion (I'm currently ignoring it, but should be used for this) which can be used to change the shape of the Kubernetes deployment according to the Celery version.

where can I get a glimpse on the biggest changes?

Actually, scratch that... you had already mentioned this above :-). I'll take a look at that.

Trying to stick to the KISS principle, I think a good option would be to keep a single CRD (the CeleryApplication CRD) and a single Docker image for all the different tools and new "roles" (thanks to the celery CLI, the celery-k8s-operator uses a single image to run the workers, flower, and the liveness probes using celery inspect, something similar could be done to launch the Controller, Router, and Publisher?). The operator would create the deployment for the core components (e.g, the Celery Controller), ensuring that the proper configuration is injected (using Kubernetes ConfigMap) so that the Celery Controller can then do its job. Since the Celery Controller is now taking some of the responsibility the operator has for a 4.X deployment, then the operator needs to ensure the controller deployment/pods have the proper Kubernetes permissions to create the required resources.

Another approach would be to treat the Celery Controller as the operator itself, in which case its deployment should be straight forward (the expectation is that an operator is a standalone deployment), and then use the CeleryApplication CRD to create objects that will be impacted into the Kubernetes cluster by the Celery Controller operator.

In general, an Kubernetes operator needs to be very simple to deploy and maintain, since it's by definition a critical piece of the infrastructure (it becomes part of the Kubernetes cluster by extending its functionality, making sure other pieces work as expected)

Take a look at what the folks at Lyft do for the Apache Flink operator: https://github.com/lyft/flinkk8soperator

Is this thread still active? Inspired by the need for this at my own work, I also tried building a POC/MVP of Celery operator to learn the whole thing - https://github.com/brainbreaker/Celery-Kubernetes-Operator

I also presented it in EuroPython 2020 last Friday as part of my talk around automating the management of Kubernetes Infra while staying in the Python ecosystem(Slides - https://bit.ly/europython20-ppt). I'm willing to commit a certain number of hours every week to build a production-ready version of the Celery operator.

In the Apache Airflow project, we use KEDA to provide autoscaling from 0 to n.
https://www.astronomer.io/blog/the-keda-autoscaler/

In the Apache Airflow project, we use KEDA to provide autoscaling from 0 to n.
https://www.astronomer.io/blog/the-keda-autoscaler/

I quite liked the approach KEDA take when I first saw that.

I'm aware of KEDA and I plan to incorporate it in our solution.

Yes, KEDA is probably the best way to go for scaling use-case. It keeps us close to native solutions like HPA and only introduces the metrics server and controller.

For my application, I was personally more focused around the learning experience so chose to implement a really basic scaling algorithm without using anything external.

What other tasks should the operator perform other than scaling?

Let me give my inputs from what I know about running Celery in production. I'm yet to read and understand the proposal and architecture for 5.x you've shared. I'll come back with more inputs.

I'm focusing on the problem of all the manual work/configuration that needs to be done while setting up Celery on K8s -

  1. Setup of worker deployments(and have a separate deployment for periodic tasks using celery beat)
  2. Setup flower deployment for observability, expose a service to make it accessible outside the cluster
  3. Worker scaling setup(using KEDA operator maybe as we discussed)
  4. Make sure things are recoverable like if any of the children(like worker deployment) goes rogue, the operator recovers it automatically. We might have to discuss different causes here.
  5. Should the operator also care about alerting when something abnormal happens(Failed tasks beyond a threshold, workers unhealthy etc)? I'm not sure there.
  6. We could include the basic broker setup too by default for use-cases where people just want to quickly start? But that might go beyond the Celery operator because brokers have their own notion of cluster.

There might be more things that come up while managing the lifecycle of a Celery application(I'm not a Celery expert rn but willing to explore/learn). I guess solving these manual steps would be a good starting point. What do you suggest?

Let me give my inputs from what I know about running Celery in production. I'm yet to read and understand the proposal and architecture for 5.x you've shared. I'll come back with more inputs.

I'm focusing on the problem of all the manual work/configuration that needs to be done while setting up Celery on K8s -

1. Setup of worker deployments(and have a separate deployment for periodic tasks using celery beat)

2. Setup flower deployment for observability, expose a service to make it accessible outside the cluster

3. Worker scaling setup(using KEDA operator maybe as we discussed)

4. Make sure things are recoverable like if any of the children(like worker deployment) goes rogue, the operator recovers it automatically. We might have to discuss different causes here.

5. Should the operator also care about alerting when something abnormal happens(Failed tasks beyond a threshold, workers unhealthy etc)? I'm not sure there.

6. We could include the basic broker setup too by default for use-cases where people just want to quickly start? But that might go beyond the Celery operator because brokers have their own notion of cluster.

There might be more things that come up while managing the lifecycle of a Celery application(I'm not a Celery expert rn but willing to explore/learn). I guess solving these manual steps would be a good starting point. What do you suggest?

you can start with celery 4.4.x as well.

Let me give my inputs from what I know about running Celery in production. I'm yet to read and understand the proposal and architecture for 5.x you've shared. I'll come back with more inputs.
I'm focusing on the problem of all the manual work/configuration that needs to be done while setting up Celery on K8s -

1. Setup of worker deployments(and have a separate deployment for periodic tasks using celery beat)

2. Setup flower deployment for observability, expose a service to make it accessible outside the cluster

3. Worker scaling setup(using KEDA operator maybe as we discussed)

4. Make sure things are recoverable like if any of the children(like worker deployment) goes rogue, the operator recovers it automatically. We might have to discuss different causes here.

5. Should the operator also care about alerting when something abnormal happens(Failed tasks beyond a threshold, workers unhealthy etc)? I'm not sure there.

6. We could include the basic broker setup too by default for use-cases where people just want to quickly start? But that might go beyond the Celery operator because brokers have their own notion of cluster.

There might be more things that come up while managing the lifecycle of a Celery application(I'm not a Celery expert rn but willing to explore/learn). I guess solving these manual steps would be a good starting point. What do you suggest?

you can start with celery 4.4.x as well.

Yes but if he will, he'll have to redesign it later on.

Let me give my inputs from what I know about running Celery in production. I'm yet to read and understand the proposal and architecture for 5.x you've shared. I'll come back with more inputs.

I'm focusing on the problem of all the manual work/configuration that needs to be done while setting up Celery on K8s -

1. Setup of worker deployments(and have a separate deployment for periodic tasks using celery beat)

2. Setup flower deployment for observability, expose a service to make it accessible outside the cluster

3. Worker scaling setup(using KEDA operator maybe as we discussed)

4. Make sure things are recoverable like if any of the children(like worker deployment) goes rogue, the operator recovers it automatically. We might have to discuss different causes here.

5. Should the operator also care about alerting when something abnormal happens(Failed tasks beyond a threshold, workers unhealthy etc)? I'm not sure there.

6. We could include the basic broker setup too by default for use-cases where people just want to quickly start? But that might go beyond the Celery operator because brokers have their own notion of cluster.

There might be more things that come up while managing the lifecycle of a Celery application(I'm not a Celery expert rn but willing to explore/learn). I guess solving these manual steps would be a good starting point. What do you suggest?

I think alerting is a good idea.
We're going to use OpenTelemetry so I'm not sure how much flower will be useful until they migrate as well.

Do read the draft, please ๐Ÿ˜„. I'd love to hear your comments.

Let's continue the operator conversation from celery issues#4213 here itself(at one place).

Do read the draft, please ๐Ÿ˜„. I'd love to hear your comments.

I'm still going through the arch doc and thinking what all operator/controller will need to do. I definitely see some major changes from the way 4.X needed to be deployed on a K8s cluster in this CEP.

I'll come back with some questions/comments by this weekend. Sorry for the delay because of my limited availability.

Okay so I reviewed the architecture for 5, it looks really promising. I've some comments which I'll add to the PR#27.

For the operator, we could support both 4.X and 5. I feel we should start with 4.4.X as per the suggestion of @auvipy. We can introduce versioning in the operator as we go along.

Correct me if I'm wrong - For 5 to go on for a stable version and be adopted by the community as a breaking release - will take time. I'm guessing more than a year. Till then and even beyond that, people would still be using 4.4.X if it's too much effort to migrate and they don't wish to use the new use-cases 5 is going to support.

4.4.X Operator will be somewhat simpler to implement and good way to start because it has less moving parts/components.

For controller implementation of 5 - we need to have a detailed discussion around the lifecycle of components(Data Sources, Sinks, router, execution platform, and so on). We also need to discuss what all would lie in the scope of the controller and what won't - For example - managing different message brokers, data sources and sinks might go beyond the work of the Celery operator.

Ideally, I'd want to try running Celery 5 in production to see the pain points and manual things to be done before writing an operator to fix those. I think there's still some way to go for that. What do you suggest @thedrow?

If you guys agree to go ahead with 4.4.X as a start, then I'll go ahead and chalk out a design document for the operator and share it with you guys as soon as I can.

your observation seems practically logical to me. I would suggest starting with 4.x first. One Goose step at a time O:)

I have recently started to create the celery operator with operator framework.
https://github.com/RyanSiu1995/celery-operator
With the operator framework, we may do something more native, like metrics exporting.
I am going to continue the development.
Hopefully, it will have a test version by the end of this month.

@auvipy @thedrow @jmdacruz
Wrote a high-level architecture document for the operator - https://brainbreaker.github.io/Celery-Kubernetes-Operator/architecture

Would like to have inputs/suggestions from you guys.

will look into this next week.

@Brainbreaker I'm going to read this today.
If you want this to be the official way to deploy Celery to k8s, you'll need to submit a CEP.
Unfortunately, the Operator CEP depends on the Controller CEP, which I haven't started yet.
We can work on that together as well to ensure we have gathered all the requirements.

You can use our template to do so.

I'm willing to shepherd this effort.

@Brainbreaker I'm going to read this today.

Awesome, thanks.

If you want this to be the official way to deploy Celery to k8s, you'll need to submit a CEP.

Yes, for sure. I'd be happy to submit a CEP.

Unfortunately, the Operator CEP depends on the Controller CEP, which I haven't started yet.
We can work on that together as well to ensure we have gathered all the requirements.

Sounds good. Although, I have written the document by keeping in mind Celery 4.4.X right now, not 5. But yeah, we should think around making it scalable to handle 5 as well.

I'm willing to shepherd this effort.

Great. I'm looking forward to your inputs.

Opened #29. It'll probably be better for you guys to review it as a CEP.

I couldn't review the rendered output for RST, however, I've tried my best to avoid any random formatting issues using online tools.