stackrox/collector

CrashLoopBackopff in Collector's Deamon Set on OpenShift 4.9

Balaji-MP opened this issue · 25 comments

Hello Team, received the following error while deploying the collector in openshift 4.9. Initially thought this is a permission issue and added the required SCC to collector's service account, but still the issue persists.

terminate called after throwing an instance of 'scap_open_exception'
  what():  can't create map: Permission denied
collector[0x448f7d]
/lib64/libc.so.6(+0x4eb80)[0x7f726981fb80]
/lib64/libc.so.6(gsignal+0x10f)[0x7f726981faff]
/lib64/libc.so.6(abort+0x127)[0x7f72697f2ea5]
/lib64/libstdc++.so.6(+0x9009b)[0x7f726a1c109b]
/lib64/libstdc++.so.6(+0x9653c)[0x7f726a1c753c]
/lib64/libstdc++.so.6(+0x96597)[0x7f726a1c7597]
/lib64/libstdc++.so.6(+0x967f8)[0x7f726a1c77f8]
/usr/local/lib/libsinsp-wrapper.so(+0x240ef5)[0x7f726c82cef5]
/usr/local/lib/libsinsp-wrapper.so(_ZN5sinsp4openERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x36)[0x7f726c866c16]
collector[0x4d2b34]
collector[0x46631c]
collector[0x442bec]
/lib64/libc.so.6(__libc_start_main+0xe5)[0x7f726980bd85]
collector[0x448e2e]
Caught signal 6 (SIGABRT): Aborted
/bootstrap.sh: line 94:    10 Aborted                 eval exec "$@"

@Balaji-MP It definitely looks like lack of permissions to load eBPF probe. Just in case, could you share the definition of DaemonSet and the SecurityContext you've got in the end?

@erthalion here is the definition and security context within in it

`apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "4"
email: support@stackrox.com
meta.helm.sh/release-name: stackrox-secured-cluster-services
meta.helm.sh/release-namespace: rhacs-operator
owner: stackrox
creationTimestamp: "2023-02-16T08:25:19Z"
generation: 4
labels:
app: collector
app.kubernetes.io/component: collector
app.kubernetes.io/instance: stackrox-secured-cluster-services
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: stackrox
app.kubernetes.io/part-of: stackrox-secured-cluster-services
app.kubernetes.io/version: 3.73.2
auto-upgrade.stackrox.io/component: sensor
helm.sh/chart: stackrox-secured-cluster-services-73.2.0
service: collector
name: collector
namespace: rhacs-operator
ownerReferences:

  • apiVersion: platform.stackrox.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: SecuredCluster
    name: stackrox-secured-cluster-services
    uid: 5b40f3be-1e30-4ded-8480-67fb0a8b03b8
    resourceVersion: "1074444903"
    uid: b4786759-f8f7-4bb8-bdef-ee975923e740
    spec:
    revisionHistoryLimit: 10
    selector:
    matchLabels:
    service: collector
    template:
    metadata:
    annotations:
    email: support@stackrox.com
    meta.helm.sh/release-name: stackrox-secured-cluster-services
    meta.helm.sh/release-namespace: rhacs-operator
    owner: stackrox
    creationTimestamp: null
    labels:
    app: collector
    app.kubernetes.io/component: collector
    app.kubernetes.io/instance: stackrox-secured-cluster-services
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: stackrox
    app.kubernetes.io/part-of: stackrox-secured-cluster-services
    app.kubernetes.io/version: 3.73.2
    helm.sh/chart: stackrox-secured-cluster-services-73.2.0
    service: collector
    namespace: rhacs-operator
    spec:
    containers:
    • env:
      • name: COLLECTOR_CONFIG
        value: '{"tlsConfig":{"caCertPath":"/var/run/secrets/stackrox.io/certs/ca.pem","clientCertPath":"/var/run/secrets/stackrox.io/certs/cert.pem","clientKeyPath":"/var/run/secrets/stackrox.io/certs/key.pem"}}'
      • name: COLLECTION_METHOD
        value: EBPF
      • name: GRPC_SERVER
        value: sensor.rhacs-operator.svc:443
      • name: SNI_HOSTNAME
        value: sensor.stackrox.svc
        image: registry.redhat.io/advanced-cluster-security/rhacs-collector-rhel8@sha256:c15a9d534e6b0bd73bee22aa8c67503e53266b47f9dd9ef11f9f05f6d007ae02
        imagePullPolicy: Always
        name: collector
        resources:
        limits:
        cpu: 750m
        memory: 1Gi
        requests:
        cpu: 50m
        memory: 320Mi
        securityContext:
        capabilities:
        drop:
        • NET_RAW
          privileged: true
          readOnlyRootFilesystem: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
      • mountPath: /host/var/run/docker.sock
        name: var-run-docker-sock
        readOnly: true
      • mountPath: /host/proc
        name: proc-ro
        readOnly: true
      • mountPath: /module
        name: tmpfs-module
      • mountPath: /host/etc
        name: etc-ro
        readOnly: true
      • mountPath: /host/usr/lib
        name: usr-lib-ro
        readOnly: true
      • mountPath: /host/sys
        name: sys-ro
        readOnly: true
      • mountPath: /host/dev
        name: dev-ro
        readOnly: true
      • mountPath: /run/secrets/stackrox.io/certs/
        name: certs
        readOnly: true
    • command:
      • stackrox/compliance
        env:
      • name: ROX_NODE_NAME
        valueFrom:
        fieldRef:
        apiVersion: v1
        fieldPath: spec.nodeName
      • name: ROX_ADVERTISED_ENDPOINT
        value: sensor.rhacs-operator.svc:443
        image: registry.redhat.io/advanced-cluster-security/rhacs-main-rhel8@sha256:727e14f925b7f6bbde4ed291a6b9c4c0e068519364b6fea5ef86126775a0cc9e
        imagePullPolicy: IfNotPresent
        name: compliance
        resources:
        limits:
        cpu: "1"
        memory: 2Gi
        requests:
        cpu: 10m
        memory: 10Mi
        securityContext:
        readOnlyRootFilesystem: true
        runAsUser: 0
        seLinuxOptions:
        type: container_runtime_t
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
      • mountPath: /etc/ssl/
        name: etc-ssl
      • mountPath: /etc/pki/ca-trust/
        name: etc-pki-volume
      • mountPath: /host
        name: host-root-ro
        readOnly: true
      • mountPath: /run/secrets/stackrox.io/certs/
        name: certs
        readOnly: true
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext:
        fsGroup: 2000
        runAsGroup: 3000
        runAsUser: 1000
        serviceAccount: collector
        serviceAccountName: collector
        terminationGracePeriodSeconds: 30
        tolerations:
    • operator: Exists
      volumes:
    • hostPath:
      path: /var/run/docker.sock
      type: ""
      name: var-run-docker-sock
    • hostPath:
      path: /proc
      type: ""
      name: proc-ro
    • emptyDir:
      medium: Memory
      name: tmpfs-module
    • hostPath:
      path: /etc
      type: ""
      name: etc-ro
    • hostPath:
      path: /usr/lib
      type: ""
      name: usr-lib-ro
    • hostPath:
      path: /sys/
      type: ""
      name: sys-ro
    • hostPath:
      path: /dev
      type: ""
      name: dev-ro
    • name: certs
      secret:
      defaultMode: 420
      items:
      • key: collector-cert.pem
        path: cert.pem
      • key: collector-key.pem
        path: key.pem
      • key: ca.pem
        path: ca.pem
        secretName: collector-tls
    • hostPath:
      path: /
      type: ""
      name: host-root-ro
    • emptyDir: {}
      name: etc-ssl
    • emptyDir: {}
      name: etc-pki-volume
      updateStrategy:
      rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
      type: RollingUpdate
      status:
      currentNumberScheduled: 3
      desiredNumberScheduled: 3
      numberMisscheduled: 0
      numberReady: 0
      numberUnavailable: 3
      observedGeneration: 4
      updatedNumberScheduled: 3`

@Balaji-MP any chance to do kubectl describe ds collector to get the events as well?

@erthalion here is the events, current state of the pod is CrashLoopBackOff

`Events:
Type Reason Age From Message


Normal SuccessfulCreate 10s daemonset-controller Created pod: collector-jst65
Normal SuccessfulCreate 3s daemonset-controller Created pod: collector-x86fj`

@erthalion I guess, the permission issue is caused because of the eval in line 94. I might be wrong, any thoughts on this ?

bootstrap.sh (including the eval part) is only responsible for starting Collector. The issue you observe is happening when Collector tries to load eBPF probes.

@erthalion any thoughts on this one ?

What happens if you remove this part from the security context?

seLinuxOptions:
  type: container_runtime_t

same error and nothing changed.

@Balaji-MP what about the SCC, you haven't posted it yet, can you show scc/stackrox-collector?

@erthalion here is the security context in stackrox-collector

securityContext: runAsUser: 1000 runAsGroup: 3000 fsGroup: 2000 containers:

@erthalion here is the security context in stackrox-collector

securityContext: runAsUser: 1000 runAsGroup: 3000 fsGroup: 2000 containers:

There is also a SecurityContextConstraints (SCC), which should have more information, e.g. if a privileged containers are allowed and similar. Having said that, can you describe more your Openshift setup, is there anything special?

@erthalion here is the SCC applied for this collector

`runAsUser:
type: RunAsAny
seLinuxContext:
type: RunAsAny
seccompProfiles:

  • '*'
    supplementalGroups:
    type: RunAsAny`

My cluster is standard and no additional restriction are in place.

@erthalion can you please share the directory location where the collector will create the map ??

@erthalion can you please share the directory location where the collector will create the map ??

It's a BPF map, so it's not located on the filesystem. The problem here is your Openshift setup somehow prevent Collector from executing the bpf syscall, we need to find out why is that.

here is the SCC applied for this collector

runAsUser:
type: RunAsAny
seLinuxContext:
type: RunAsAny
seccompProfiles:

'*'
supplementalGroups:
type: RunAsAny

This doesn't look complete, isn't there anything saying something like below?

allowPrivilegeEscalation: true
allowPrivilegedContainer: true

@erthalion no I don't see anything related to allowPriviledged escalation / container.

no I don't see anything related to allowPriviledged escalation / container.

That sounds strange to me. So the output of oc get scc/stackrox-collector -o yaml doesn't show anything else except what you've posted?

Yes, that's correct

@stackrox/collector-team any updates on this issue?

Unfortunately no, nobody had a capacity to look further into it.

@Balaji-MP TBH Openshift 4.9 is quite dated... might even be out of support? Would it be feasible for you to upgrade to a more recent version?

@porridge let me update to the latest version and can check. In the mean time, do you have a recommended version or above ?

4.12 would be my first choice

@porridge You are correct, after upgrading to version 4.12 it fixed the issue.

Awesome! Let us know if you need anything else.