banzaicloud/koperator

KOperator is stuck in a rebalance disks loop

ilievladiulian opened this issue · 4 comments

Describe the bug
Koperator is stuck in a rebalance disks loop which fails with the following message:

{
    "level":"error",
    "ts":"2022-08-25T08:20:33.243Z",
    "msg":"re-balancing disk(s) in Kafka cluster via Cruise Control failed",
    "controller":"CruiseControl",
    "controllerGroup":"kafka.banzaicloud.io",
    "controllerKind":"KafkaCluster",
    "kafkaCluster": {
        "name":"kafka-test",
        "namespace":"ns-kafka-test"
    },
    "namespace":"ns-kafka-test",
    "name":"kafka-test",
    "reconcileID":"2ef740c0-19ba-4091-bfc9-45f0a161a830",
    "operation":"rebalance disks",
    "brokers":["1","2","3"],
    "error":"json: cannot unmarshal number 1.0 into Go struct field BrokerLoadStats.brokers.NumCore of type int32",
    "stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"
}

Steps to reproduce the issue:

  1. Use a Cruise Control version higher than 2.5.94
  2. Create a kafkacluster CR with one log dir per broker and apply it.
  3. Modify kafkacluster to have 2 log dirs per broker and apply it.
  4. Modify kafkacluster to have 1 log dir per broker and apply it.
  5. KOperator stuck in rebalance disks operation with the error above.

Expected behavior
Koperator applies the changes without errors.

Additional context
The error above is caused by the rebalance disks call made by koperator to cruise control. After release 2.5.94, cruise control uses a double field for the number of cores used in host load and broker load responses instead of int (PR-1839). The go-cruise-control client still uses an int field (see here).

Edit update: added full error message.

Hi @ilievladiulian!
Thanks for the bug report!
If you are sure this is what's causing the issue it seems like a pretty easy fix, would you be interested in creating a PR addressing this?

Hi, @Kuvesz!
As far as I can tell, the change in Cruise Control breaks compatibility with previous versions. Should the change in Koperator ensure backwards compatibility, or break it as well?

Well right now we don't support the CruiseControl version you mentioned (check supported versions here), but sooner or later we will have to and in that case it would be best to keep backwards compatibility by checking the Cruise Control version and creating a branching logic based on that. If that's not possible we'll just have to follow suite and break compatibility.

I have made a PR for this: banzaicloud/go-cruise-control#11
I have tested with CC 2.5.101 and it works.
Koperator next version will contain this and many more.