collector logging running amok

Question

collector logging running amok

Elyytscha opened this issue 2 years ago · 9 comments

our aggregated logging was exploding due to stackrox collector logging.. there where millions of lines like this the last days from 3/10 collector nodes..

actually to mitigate as fast as possible we had to delete stackrox

this is the error message which where present millions of times, this message appaered on 3 from 10 collector nodes about 6-10 times every millisecond..
[E 20221209 165715 ConnScraper.cpp:415] Could not determine network namespace: No such file or directory

it was deployed like this:
/bin/bash <(curl -fsSL https://raw.githubusercontent.com/stackrox/stackrox/master/scripts/quick-helm-install.sh)

on gke v1.24

Answer 1 · 2022-12-09T17:52:28.000Z

Thanks for sharing this @Elyytscha, I'm sorry that this caused you to have to delete stackrox. To help in our investigation can you share any relevant details about the deployments running on the 3/10 affected nodes? Also you using GKE stable release channel?

Answer 2 · 2022-12-12T13:30:46.000Z

Yeah, it was just an evaluation setup, so we didn't run this in production, so its not that critical that we have deleted it, but when this happens in an evaluation you will agree that this is a bad situation..

yes we are on gke stable release channel.

for relevant details about deployments i have to investigate a little bit.

Answer 3 · 2022-12-12T17:24:30.000Z

@Elyytscha, thanks for this confirmation of the release channel. We will make sure to fix the logging flood soon.

Would you be aware of a common factor to these 3 nodes having the issue ? Is something different about them relatively to the other 7 ones ?

We will try to reproduce the issue, but in the meantime, it would greatly help if you could provide us with the result of ls -lR /proc/xxxx on the host of a failing node, replacing xxxx by any PID available.

Answer 4 · 2022-12-12T17:51:17.000Z

@Elyytscha, based on the early analysis of this issue, it seems that you should be able to workaround by disabling the scraper component of collector. This can be achieved by modifying the daemonset definition and set "turnOffScrape" to true in the COLLECTOR_CONFIG env var.

https://github.com/stackrox/collector/blob/master/docs/references.md#collector-config

Answer 5 · 2022-12-13T12:32:09.000Z

Hello,

first I wanted to say thanks for your help! This is not self-evident within opensource projects, to get answers that fast.

Would you be aware of a common factor to these 3 nodes having the issue ? Is something different about them relatively to the other 7 ones ?

no, those nodes a from the same resource pool, the only difference to other nodes could be the pods which get scheduled to those nodes

maybe some information is that we use calico as SDN

this info i can give about pods running on one of the affected nodes (but its possible that the pods which caused this could be already scheduled to another node..)

We will try to reproduce the issue, but in the meantime, it would greatly help if you could provide us with the result of ls -lR /proc/xxxx on the host of a failing node, replacing xxxx by any PID available.

be prepared, the logfile i got for this is long.. its so big, i had to split it into two files, but i think what you want to see you will find there, because there are also 'No such file or directory' messages
getpids.log
getpids.2.log

@Elyytscha, based on the early analysis of this issue, it seems that you should be able to workaround by disabling the scraper component of collector. This can be achieved by modifying the daemonset definition and set "turnOffScrape" to true in the COLLECTOR_CONFIG env var.

There i have some questions:

Which functionality from stackrox will be disabled/unavailable if we turn this off?
Is this possible via the helm chart? https://github.com/stackrox/helm-charts/blob/5cd826a14d7c30d1b7ca538b4ff71d1723339a2c/3.72.2/secured-cluster-services/values-public.yaml.example#L324 would this be the right place for setting this env var?

Answer 6 · 2022-12-14T16:36:18.000Z

We debugged a little bit down the rabbit hole, what we found out is that this happens on nodes where we have running static jenkins agents which are building docker container in containerd.

the relating pids where the error appears within stackrox are looking like this:

# ps aux | grep defunct
root       38481  0.0  0.0      0     0 ?        Z    15:36   0:00 [containerd-shim] <defunct>
root       38482  0.0  0.0      0     0 ?        Z    15:36   0:00 [containerd-shim] <defunct>
root       38804  0.0  0.0      0     0 ?        Z    15:36   0:00 [containerd-shim] <defunct>
root       38866  0.0  0.0      0     0 ?        Z    15:36   0:00 [containerd-shim] <defunct>
root       38958  0.0  0.0      0     0 ?        Z    15:36   0:00 [containerd-shim] <defunct>
root       39059  0.0  0.0      0     0 ?        Z    15:36   0:00 [containerd-shim] <defunct>
root       39193  0.0  0.0      0     0 ?        Z    15:36   0:00 [containerd-shim] <defunct>
root       91761  0.0  0.0      0     0 ?        Z    16:22   0:00 [containerd-shim] <defunct>
root       91807  0.0  0.0      0     0 ?        Z    16:22   0:00 [containerd-shim] <defunct>
root       91952  0.0  0.0      0     0 ?        Z    16:22   0:00 [containerd-shim] <defunct>
root       92009  0.0  0.0      0     0 ?        Z    16:22   0:00 [containerd-shim] <defunct>
root       92243  0.0  0.0      0     0 ?        Z    16:22   0:00 [containerd-shim] <defunct>
root       92332  0.0  0.0      0     0 ?        Z    16:22   0:00 [containerd-shim] <defunct>
root       92577  0.0  0.0      0     0 ?        Z    16:22   0:00 [containerd-shim] <defunct>

and we think its related to this problem in containerd:
containerd/containerd#5708

Answer 7 · 2022-12-14T20:21:21.000Z

Good to hear that you may have identified the root cause. Setting turnOffScrape to true will solve the problem with the logging statement, but finding the root cause and solving that is the best option. turnOffScrape has not been extensively tested and is used for internal debugging. When you set it to true you will lose information about endpoints and connections formed before collector was turned on. You can set it via helm charts, but that is not the method that I would recommend. You have to be careful when doing it that way, as COLLECTOR_CONFIG is used to set a few different parameters and you want to set those correctly, not just turnOffScrape. A command you could run to set turnOffScrape is

kubectl set env ds/collector COLLECTOR_CONFIG='{"turnOffScrape":true,"tlsConfig":{"caCertPath":"/var/run/secrets/stackrox.io/certs/ca.pem","clientCertPath":"/var/run/secrets/stackrox.io/certs/cert.pem","clientKeyPath":"/var/run/secrets/stackrox.io/certs/key.pem"}}' --namespace stackrox

To use helm charts you have to set "tlsConfig" correctly. See https://github.com/stackrox/helm-charts/blob/5cd826a14d7c30d1b7ca538b4ff71d1723339a2c/3.72.2/secured-cluster-services/templates/collector.yaml#L[…]2

Before your latest comment I thought that the problem might be that your resource limits and requests for resources in namespaces other than stackrox being too low. It might still be worth looking into that.

Answer 8 · 2022-12-15T09:32:57.000Z

We fixed it, the issue was due to our old docker in docker container build system, for new systems we actually use kaniko but for old legacy systems there are docker builds via dockerd in a containerd system, we fixed it basically due to this comment:
docker-library/docker#318 (comment)

docker has added tini as docker-init in their container and we used it with:

ENTRYPOINT ["docker-init", "--", "dockerd-entrypoint.sh"]

after this we had no zombie/defunct processes anymore after docker in docker build where running.

actually someone could argument that stackrox showed us an issue in our k8s cluster, but the way it showed us the issue basically produced another issue (flooding our log system with an unnecessary amount of logs).

but still i think it would be a good idea to limit those log messages stackrox btw. collector will produce in such a situation.

the situation appears when there are zombie/defunct processes from old containers which somehow dont get reaped.

Answer 9 · 2022-12-15T17:36:19.000Z

Glad to hear that you resolved your problem. Thanks for bringing this to our attention. Based on this experience we plan a few improvements to collector including throttling of logging statements and better handling of defunct processes.