replicatedhq/troubleshoot

Performance: support-bundle takes > 3 mins

xavpaice opened this issue · 8 comments

Bug Description

When running tests in KOTS, the support-bundle CLI test takes > 3 mins to run, causing timeouts. While increasing the timeout is a workaround, 3 mins is too long to expect Troubleshoot to run.

Expected Behavior

support-bundle returns in < 3 mins

Steps To Reproduce

kURL cluster: https://kurl.sh/8cb4d5d
Command: kubectl support-bundle --load-cluster-specs --redactors=configmap/default/kotsadm-redact-spec/redact-spec,configmap/default/kotsadm-qakotsregression-redact-spec/redact-spec --interactive=false --allow-insecure-connections

Test: https://app.testim.io/#/project/wpYAooUimFDgQxY73r17/branch/master/test/uqPj73W2fTCb[…]-id=za47mjQ58SYjqckP&path=iu3mBMtET269b9Oc%3A7IGXsuKvhZ1myT35

curl -sSL https://kurl.sh/8cb4d5d | sudo bash

Additional Context

Include the following information.

  • Troubleshoot version. If you built from source, note that including the version of Go you used to build with.
  • Operating system
  • Operating system version
  • Other details that might be helpful in diagnosing the problem

TODO find the troubleshoot version, content of the extra redactors, and the Troubleshoot spec stored in cluster.

Using v0.69.3 (v0.70.0 is current), and the following specs:

apiVersion: troubleshoot.sh/v1beta2
kind: Collector
metadata:
  name: sentry-enterprise
spec:
  collectors:
    - secret:
        name: myapp-postgres
        key: uri
        includeValue: false
    - logs:
        selector:
          - app=sentry
          - role=worker
        limits:
          maxAge: 720h # 30*24
          maxLines: 10000
    - run:
        collectorName: ping-google
        image: flungo/netutils
        command: ["ping"]
        args: ["www.google.com"]
        timeout: 5s
    - http:
        collectorName: echo-ip
        get:
          url: https://api.replicated.com/market/v1/echo/ip
apiVersion: troubleshoot.sh/v1beta2
kind: Redactor
metadata:
  name: kotsadm-redact
spec:
  redactors:
  - name: redact cluster version info
    fileSelector:
      file: cluster-info/cluster_version.json
    removals:
      regex:
      - redactor: ("major"?:) (")(?P<mask>.*)(")
      - redactor: ("minor"?:) (")(?P<mask>.*)(")

Result:


============ Collectors summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
run/ping-google (F)             : 36,044ms
cluster-resources (S)           : 1,648ms
http/echo-ip (S)                : 279ms
configmap/coredns (S)           : 6ms
logs/app=sentry,role=worker (S) : 5ms
cluster-info (S)                : 1ms
secret (S)                      : 0ms

============ Redactors summary =============
In-cluster collectors : 6,902ms

============= Analyzers summary =============
Suceeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed

Duration: 44,913ms

real	0m45.313s
user	0m30.832s
sys	0m2.266s

Of that time, the longest gap:

I0718 23:43:25.639488   42228 run_pod.go:450] Pod ping-google has been scheduled for deletion
I0718 23:43:25.639516   42228 run_pod.go:454] Continuously poll each second for Pod ping-google deletion for maximum 60 seconds



I0718 23:43:56.645939   42228 redact.go:54] Redacting cluster-resources/priorityclasses.json
I0718 23:43:56.645952   42228 run.go:155] Collecting support bundle: failed to run collector: run/ping-google: timeout

support-bundle.txt

Attached: logs from a cluster where this is displayed.

I can't see any clear gaps in the logs where a particular thing is taking the bulk of the time, more that there's a lot of items in the spec and some of those individually take more time than others to complete, and everything runs sequentially to avoid overloading the k8s API with requests.

This seems interesting. Look at these two pod times.

I0719 13:53:10.766915   50166 run.go:155] Collecting support bundle: run/ping-google
I0719 13:53:10.776941   50166 run_pod.go:155] Pod ping-google has been created
I0719 13:53:15.777192   50166 result.go:25] Creating symlink /ping-google.log -> cluster-resources/pods/logs/default/ping-google/collector.log
I0719 13:53:15.777402   50166 result.go:75] Added "/ping-google.log" symlink of "cluster-resources/pods/logs/default/ping-google/collector.log" in bundle output
I0719 13:53:15.789797   50166 result.go:130] Added ".json" to bundle output
I0719 13:53:15.789912   50166 result.go:130] Added "-events.json" to bundle output
I0719 13:53:15.799139   50166 run_pod.go:450] Pod ping-google has been scheduled for deletion
I0719 13:53:15.799175   50166 run_pod.go:454] Continuously poll each second for Pod ping-google deletion for maximum 60 seconds
I0719 13:53:46.808187   50166 run.go:155] Collecting support bundle: failed to run collector: run/ping-google: timeout

I'm guessing that's about 5 seconds to create the pod, as the delay from 'creating' to the log collection.
However, scheduling the pod for deletion times out and errors after about 30 seconds. Thats the vast majority of the runtime for this pod.

The very next pod doesn't see this deletion time, or really almost any deletion time at all.

I0719 13:53:46.808219   50166 run.go:155] Collecting support bundle: run/templated-registry-replicated-com
I0719 13:53:46.836244   50166 run_pod.go:155] Pod templated-registry-replicated-com has been created
I0719 13:53:55.976823   50166 result.go:25] Creating symlink templated-registry-replicated-com/templated-registry-replicated-com.log -> cluster-resources/pods/logs/default/templated-registry-replicated-com/collector.log
I0719 13:53:55.977088   50166 result.go:75] Added "templated-registry-replicated-com/templated-registry-replicated-com.log" symlink of "cluster-resources/pods/logs/default/templated-registry-replicated-com/collector.log" in bundle output
I0719 13:53:55.989749   50166 result.go:130] Added "templated-registry-replicated-com/templated-registry-replicated-com.json" to bundle output
I0719 13:53:55.989855   50166 result.go:130] Added "templated-registry-replicated-com/templated-registry-replicated-com-events.json" to bundle output
I0719 13:53:56.007181   50166 run_pod.go:450] Pod templated-registry-replicated-com has been scheduled for deletion
I0719 13:53:56.007216   50166 run_pod.go:454] Continuously poll each second for Pod templated-registry-replicated-com deletion for maximum 60 seconds
I0719 13:53:56.010445   50166 run.go:155] Collecting support bundle: run/templated-quay-io
I0719 13:53:56.028769   50166 run_pod.go:155] Pod templated-quay-io has been created
I0719 13:54:00.093888   50166 result.go:25] Creating symlink templated-quay-io/templated-quay-io.log -> cluster-resources/pods/logs/default/templated-quay-io/collector.log
I0719 13:54:00.094105   50166 result.go:75] Added "templated-quay-io/templated-quay-io.log" symlink of "cluster-resources/pods/logs/default/templated-quay-io/collector.log" in bundle output
I0719 13:54:00.108496   50166 result.go:130] Added "templated-quay-io/templated-quay-io.json" to bundle output
I0719 13:54:00.108637   50166 result.go:130] Added "templated-quay-io/templated-quay-io-events.json" to bundle output
I0719 13:54:00.139593   50166 run_pod.go:450] Pod templated-quay-io has been scheduled for deletion
I0719 13:54:00.139630   50166 run_pod.go:454] Continuously poll each second for Pod templated-quay-io deletion for maximum 60 seconds
I0719 13:54:00.144790   50166 run.go:155] Collecting support bundle: run/not-templated-quay-io

That looks suspiciously like the 30 second grace period that the pod is given to shut down. Considering we can't control how the pod behaves to a TERM and we've already collected our logs, perhaps we should be setting the grace-period to 0 and just moving on?

Edit:

Here it is again, we're loosing minutes to what I suspect is a deletion timeout.

I0719 13:51:04.150622   50166 run.go:155] Collecting support bundle: copy-from-host/kurl-host-preflights
I0719 13:51:06.255574   50166 result.go:130] Added "host-preflights/results-1689773363.txt" to bundle output
I0719 13:51:06.258035   50166 copy_from_host.go:375] Daemonset troubleshoot-copyfromhost-z62vx has been scheduled for deletion
I0719 13:51:06.264389   50166 copy_from_host.go:387] Continuously poll each second for Pod deletion of DaemontSet troubleshoot-copyfromhost-z62vx for maximum 60 seconds
I0719 13:51:38.276305   50166 run.go:155] Collecting support bundle: data

Wow even the collectd run is basically all waits and timeouts.

I0719 13:51:38.287858   50166 run.go:155] Collecting support bundle: collectd/collectd
I0719 13:52:08.298974   50166 copy_from_host.go:233] Timed out waiting for daemonset troubleshoot-copyfromhost-2sxw8 to be ready
I0719 13:52:08.328778   50166 result.go:130] Added "collectd/rrd/craig-qakotsregression/error.txt" to bundle output
I0719 13:52:08.328841   50166 copy_from_host.go:375] Daemonset troubleshoot-copyfromhost-2sxw8 has been scheduled for deletion
I0719 13:52:08.336949   50166 copy_from_host.go:387] Continuously poll each second for Pod deletion of DaemontSet troubleshoot-copyfromhost-2sxw8 for maximum 60 seconds
I0719 13:53:08.337791   50166 copy_from_host.go:415] Pod troubleshoot-copyfromhost-2sxw8-tdwws forcefully deleted after reaching the maximum wait time of 60 seconds
I0719 13:53:08.353265   50166 copy_from_host.go:423] Daemonset pod troubleshoot-copyfromhost-2sxw8-tdwws in default namespace has been deleted
I0719 13:53:08.353825   50166 run.go:155] Collecting support bundle: copy/spec-qakotsregression

Any chance we can see the error.txt on that thing?

  • 30 seconds to decide it's not coming up
  • 1 minute waiting for it to shutdown and then force deleting it

It spent 90 seconds of it's 90.065s run time waiting for it to start and waiting for it to shut down.

If we:

  • Set grace period on deletion to 0
  • Force delete with a grace period of 0 if the pod errors

We would remove 120 seconds of run time. (30 for 2 pods that didn't shut down gracefully, and 60 waiting to delete a pod that never made it to healthy and was force deleted anyway).

I'm just assuming we're not force deleting or setting grace period based on these timeouts. If we are waiting 60 seconds before force deleting, I propose we just start with that if the pod errored in the first place. 60 seconds is a pretty long timeout too.

banjoh commented

If we:

  • Set grace period on deletion to 0
  • Force delete with a grace period of 0 if the pod errors

We would remove 120 seconds of run time. (30 for 2 pods that didn't shut down gracefully, and 60 waiting to delete a pod that never made it to healthy and was force deleted anyway).

I'm just assuming we're not force deleting or setting grace period based on these timeouts. If we are waiting 60 seconds before force deleting, I propose we just start with that if the pod errored in the first place. 60 seconds is a pretty long timeout too.

Work was done in #1196 PR to utilise zero second grace periods to force deleting unhealthy pods. For healthy pods launched by the run_pod collector, termination needs to take timeouts into consideration which has not been implemented. I describe the problem with possible solutions here.

As for ping-google collector, adding a timeout to the ping command should do the trick.

apiVersion: troubleshoot.sh/v1beta2
kind: Collector
metadata:
  name: sentry-enterprise
spec:
  collectors:
    - run:
        collectorName: ping-google
        image: flungo/netutils
        command: ["ping"]
        args: ["google.com", "-c", "3", "-w", "5"]
        timeout: 5s
============ Collectors summary =============
Succeeded (S), eXcluded (X), Failed (F)
=============================================
run/ping-google (S)   : 5,578ms
cluster-resources (S) : 2,396ms
cluster-info (S)      : 7ms

============ Redactors summary =============
In-cluster collectors : 518ms

============= Analyzers summary =============
Succeeded (S), eXcluded (X), Failed (F)
=============================================
No analyzers executed

Duration: 8,528ms
banjoh commented

I create #1282 issue which would address collectd collector which is a special case of this specific problem

The PR #1281 is ready for review.