composable-operator/composable

Creation of composed resources very slow

Closed this issue · 5 comments

We love Composable and have been using it for quite a while, but lately we're noticing that it takes longer and longer to create resources through Composables. It looks like the process takes longer and longer the more things we deploy to the cluster. The first Composables we deployed took a matter of seconds to come online. Now, in a cluster with currently 28 Composables across 26 namespaces (though only 8 of those with Composables), the creation of the latest one took over 15 minutes.

I've tried increasing the number of concurrent reconciles, but it doesn't seem to have any effect. I've also recently updated from a previous 0.1.x version to the latest 0.2.1, but I feel like the problem got even worse with the update.

One thing I'm seeing in the logs of the controller is requests to seemingly completely unrelated APIs:

manager I0804 14:34:21.897365       1 request.go:665] Waited for 2m0.395606586s due to client-side throttling, not priority and fairness, request: GET:https://10.114.224.1:443/apis/redis.cnrm.cloud.google.com/v1beta1?timeout=32s           
manager I0804 14:34:31.946912       1 request.go:665] Waited for 8.347435543s due to client-side throttling, not priority and fairness, request: GET:https://10.114.224.1:443/apis/bigquery.cnrm.cloud.google.com/v1beta1?timeout=32s          
manager I0804 14:34:41.947209       1 request.go:665] Waited for 18.296080044s due to client-side throttling, not priority and fairness, request: GET:https://10.114.224.1:443/apis/rabbitmq.com/v1beta1?timeout=32s 

We don't have any Composables that create or reference any of these, but I'm wondering if the throttling for these requests might also apply to the requests to get Composables and create/update the underlying resources, which might explain the slowness. However, I can't see anything in the Composable code that would explain these requests. We do have a sizable number of CRDs, but we really only use Composable to create a few secrets from other secrets, so I wouldn't expect any impact.

I'm not the biggest pro in Go, so maybe I've missed something in the code that would explain this. Or maybe this slowness is expected (though that feels unlikely). I'm happy for any pointers, or even just other people that maybe have the same issue :)

xvzf commented

We're not noticing this yet but likely will in the future. Can you give more details on how many resources you have etc?

There's been a huge change from 0.1 to 0.2.1 (linked to the compatibility with newer k8s versions); it might be that we need to tune the client limits.

Sure! We currently have a little over 200 CRDs (same for all clusters), and between 5 and 30 Composables per cluster. Notably, 159 of the CRDs are Config Connector resources, so will likely affect many people running on GKE.

We actually worked around this issue in the meantime by increasing the QPS of the REST client to something way above the number of installed CRDs, so that it doesn't get throttled anymore:

diff --git a/controllers/composable_controller.go b/controllers/composable_controller.go
index bf96ad4..4c66fd7 100644
--- a/controllers/composable_controller.go
+++ b/controllers/composable_controller.go
@@ -78,6 +78,8 @@ type ManagerSettableReconciler interface {
 // NewReconciler ...
 func NewReconciler(mgr ctrl.Manager) ManagerSettableReconciler {
        cfg := mgr.GetConfig()
+       cfg.QPS = 300.0
+
        return &ComposableReconciler{
                Client: mgr.GetClient(),
                Scheme: mgr.GetScheme(),

(and then building a custom image and using that)

Not the greatest solution, but it solved the problem of Composable essentially being unusable for us. Maybe this could be a flag for Composable itself, so that building a custom image is not necessary.

However, that still doesn't really solve the issue of this many requests being sent in the first place - which from our research looks to be due to the K8s discovery information not being cached (what's usually in ~/.kube/cache/discovery). This is absolutely not my area of expertise, so not sure if that's by design or even trivial to enable.

xvzf commented

Alright; we are also running on GKE and likely will face the same issue once scaling out.

Feel free to create a PR for your temporary fix and e.g. configure it using an env var.

I'll look into the discovery caching; when I ported the code to kubebuilder v3 I took some notes but did not address them yet.

I created a WIP implementation of the suggestion above: #103

to see if it would help resolve another issue I'm seeing: #102

So far no luck:

failed to create resource: Internal error occurred: failed calling webhook "mcomposable.kb.io": failed to call webhook: Post "https://composable-webhook-service.composable-system.svc:443/mutate-ibmcloud-ibm-com-v1alpha1-composable?timeout=10s": dial tcp 10.8.0.26:9443: i/o timeout
helm.go:84: [debug] Internal error occurred: failed calling webhook "mcomposable.kb.io": failed to call webhook: Post "https://composable-webhook-service.composable-system.svc:443/mutate-ibmcloud-ibm-com-v1alpha1-composable?timeout=10s": dial tcp 10.8.0.26:9443: i/o timeout

Manager logs:

❯ kubectl -n composable-system logs -f deployments/composable-controller-manager
I0228 18:39:35.011002       1 request.go:665] Waited for 1.015355126s due to client-side throttling, not priority and fairness, request: GET:https://10.7.0.1:443/apis/bigquery.cnrm.cloud.google.com/v1beta1?timeout=32s
{"level":"info","ts":1677609577.4164364,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1677609577.4171565,"logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"ibmcloud.ibm.com/v1alpha1, Kind=Composable","path":"/mutate-ibmcloud-ibm-com-v1alpha1-composable"}
{"level":"info","ts":1677609577.4172797,"logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/mutate-ibmcloud-ibm-com-v1alpha1-composable"}
{"level":"info","ts":1677609577.417378,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"ibmcloud.ibm.com/v1alpha1, Kind=Composable","path":"/validate-ibmcloud-ibm-com-v1alpha1-composable"}
{"level":"info","ts":1677609577.4174392,"logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/validate-ibmcloud-ibm-com-v1alpha1-composable"}
{"level":"info","ts":1677609577.4175375,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1677609577.417829,"logger":"controller-runtime.webhook.webhooks","msg":"Starting webhook server"}
{"level":"info","ts":1677609577.4181302,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1677609577.4182177,"logger":"controller-runtime.webhook","msg":"Serving webhook server","host":"","port":9443}
{"level":"info","ts":1677609577.4183576,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"127.0.0.1:8080"}
{"level":"info","ts":1677609577.4184022,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":1677609577.4210615,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
I0228 18:39:37.421191       1 leaderelection.go:248] attempting to acquire leader lease composable-system/c0f58598.ibm.com...
I0228 18:39:54.489885       1 leaderelection.go:258] successfully acquired lease composable-system/c0f58598.ibm.com
{"level":"info","ts":1677609594.4902499,"logger":"controller.composable","msg":"Starting EventSource","reconciler group":"ibmcloud.ibm.com","reconciler kind":"Composable","source":"kind source: *v1alpha1.Composable"}
{"level":"info","ts":1677609594.4903157,"logger":"controller.composable","msg":"Starting Controller","reconciler group":"ibmcloud.ibm.com","reconciler kind":"Composable"}
{"level":"info","ts":1677609594.5915842,"logger":"controller.composable","msg":"Starting workers","reconciler group":"ibmcloud.ibm.com","reconciler kind":"Composable","worker count":1}

xvzf commented

This should be resolved by v0.3.0 and onwards, thanks a lot @lhriley @lauraseidler