Cruise Control Not Recreated On Deletion
Spazzy757 opened this issue · 7 comments
Description
I updated the Helm Chart to Latest and KOperator to 0.24.1 and was having some issue with the dpeloyment/upgrade
During this process Cruise Control Was deleted, however I cant seem to get the operator to recreate it.
I see there is an issue here: #740 but it says that it started working again, but Im on the latest version and have this issue
Expected Behavior
Cruise Control Should be Recreated
Actual Behavior
Cruise Control Does not exist and I keep getting errors:
{
"level":"error",
"ts":"2023-06-06T07:15:49.909Z",
"msg":"failed to get unavailable brokers at rebalance",
"controller":"CruiseControlTask",
"controllerGroup":"kafka.banzaicloud.io",
"controllerKind":"KafkaCluster",
"KafkaCluster":{
"name":"kafka",
"namespace":"default"
},
"namespace":"default",
"name":"kafka",
"reconcileID":"f3a57a35-bb1b-49d7-893f-2751f6937fa4",
"error":"failed to get list of volumes per broker from Cruise Control: sending HTTP request failed: Get \"http://kafka-cruisecontrol-svc.default.svc.cluster.local:8090/kafkacruisecontrol/kafka_cluster_state?json=true&verbose=true\": dial tcp: lookup kafka-cruisecontrol-svc.default.svc.cluster.local on 10.91.0.10:53: no such host",
"errorVerbose":"sending HTTP request failed: Get \"http://kafka-cruisecontrol-svc.default.svc.cluster.local:8090/kafkacruisecontrol/kafka_cluster_state?json=true&verbose=true\": dial tcp: lookup kafka-cruisecontrol-svc.default.svc.cluster.local on 10.91.0.10:53: no such host\nfailed to get list of volumes per broker from Cruise Control\ngithub.com/banzaicloud/koperator/controllers.checkBrokerLogDirsAvailability\n\t/workspace/controllers/cruisecontroltask_controller.go:239\ngithub.com/banzaicloud/koperator/controllers.(*CruiseControlTaskReconciler).Reconcile\n\t/workspace/controllers/cruisecontroltask_controller.go:178\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594",
"stacktrace":"github.com/banzaicloud/koperator/controllers.(*CruiseControlTaskReconciler).Reconcile\n\t/workspace/controllers/cruisecontroltask_controller.go:180\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:234"
}
Affected Version
0.24.1
Steps to Reproduce
- Fresh Install of Koperator
- Cretae kafkacluster
- Delete Cruise Control
Checklist
- I have read the contributing guidelines
- I have verified this does not duplicate an existing issue
I also tried going onto you slack group but it seems you need a Cisco email address 🥲
Seems the reason is due to the CC Topic creation:
{
"level":"info",
"ts":"2023-06-06T11:12:13.412Z",
"logger":"webhooks.KafkaTopic",
"msg":"rejected",
"name":"kafka-cruise-control-topic",
"namespace":"default",
"invalid field(s)":"spec.name: Invalid value: \"__CruiseControlMetrics\": topic \"__CruiseControlMetrics\" already exists on kafka cluster and it is not managed by Koperator,\n\t\t\t\t\tif you want it to be managed by Koperator so you can modify its configurations through a KafkaTopic CR,\n\t\t\t\t\tadd this \"managedBy: koperator\" annotation to this KafkaTopic CR"
}
Okay for anybody that runs into this, there seems to be a race condition, when cruise control Is deleted and the __CruiseControlMetrics
still exists, operator will not recreate the cruise control deployment, the way I handled this was I deleted the topic by running:
kubectl run kafka-topic -it \
--image=ghcr.io/banzaicloud/kafka:2.13-3.1.0 \
--rm=true \
--restart=Never \
-- /opt/kafka/bin/kafka-topics.sh --bootstrap-server kafka-headless:29092 --topic __CruiseControlMetrics --delete
This deleted the topic and reconcile logic then created the Cruise Control Deployment, I think the issue arises from a race condition here
@Spazzy757 thank you for reporting this issue and the information! Please don't hesitate to open a PR for the fix if you found the root-cause already.
About the Slack issue, can you please share a screen to show why it would take Cisco email for you to join so we can get it fixed by our admins?
In the meantime, please feel free to join via this temporary link: https://emergingtechcommunity.slack.com/archives/CK75ATB29
@Spazzy757 Did you try with Google or Apple? I just tried with a fake Google account and I was able to join our Slack with the link we have in README https://eti.cisco.com/slack
yes it works for me, the other one I got was from a comment from an issue, hence probably why it was not working