csi-do-controller-0 CrashLoopBackOff: couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json"

Question

csi-do-controller-0 CrashLoopBackOff: couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json"

max3903 opened this issue 4 years ago · 23 comments

What did you do? (required. The issue will be closed when not provided.)

I followed the documentation to add the do-block-storage plugin:

I added the secret successfully and run:

kubectl apply -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml

It fails on some snapshot specific stuff:

CustomResourceDefinition.apiextensions.k8s.io "volumesnapshots.snapshot.storage.k8s.io" is invalid: spec.version: Invalid value: "v1alpha1": must match the first version in spec.versions

I moved on (I believe it is fixed by #322) and tried to create a PVC.

What did you expect to happen?

I was expecting the PV to be created.

Configuration (MUST fill this out):

system logs:

https://gist.github.com/max3903/acb18527be1138a33d77f3eaaddb89a8

manifests, such as pvc, deployments, etc.. you used to reproduce:

secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: digitalocean
  namespace: kube-system
stringData:
  access-token: "3e8[...]ec5"

pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: jenkins-data
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: do-block-storage

CSI Version:

1.3.0

Kubernetes Version:

1.17

Cloud provider/framework version, if applicable (such as Rancher):

OKD 4.5

Answer 1 · 2020-06-16T14:14:19.000Z

Other information:

I am using OKD 4.5 on Fedora CoreOS 31.

The pod csi-do-controller-0 remains in status CrashLoopBackOff.

4 out of 5 containers are in state Running but have this error message in the log:

connection.go:170] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock

The last one csi-do-plugin (digitalocean/do-csi-plugin:v1.3.0) remains in state Waiting and the logs says:

couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json": dial tcp 169.254.169.254:80: connect: connection refused (are you running on DigitalOcean droplets?)

On the worker, the csi.sock is not in:

/var/lib/csi/sockets/pluginproxy/csi.sock

but in

/var/lib/kubelet/plugins/dobs.csi.digitalocean.com/csi.sock

Answer 2 · 2020-06-16T23:13:50.000Z

Hi @max3903

the error

couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json": dial tcp 169.254.169.254:80: connect: connection refused (are you running on DigitalOcean droplets?)

is odd because it usually means that you are not running on DigitalOcean infrastructure (as the error indicates). However, I do see a DO region label on one of your Nodes. Can you confirm that you are indeed running on droplets? Can you connect to the metadata endpoint from your nodes?

What might also be good to know: did you try to apply the manifests on a cluster that had a previous version of the CSI driver installed already, or was this a first-time CSI installation attempt?

Answer 3 · 2020-06-17T13:19:06.000Z

Hello @timoreimann

Yes I am running on droplets built from a custom image: Fedora CoreOS 31 for Digital Ocean from https://getfedora.org/en/coreos/download?tab=cloud_operators&stream=stable

Yes, I can connect to the metadata endpoint from the 3 masters and 2 workers. That is actually how each droplet get their hostname during the installation:
See coreos/fedora-coreos-tracker#538

Yes, I tried to apply the manifests multiple time using different versions/urls.
I tried 0.3.0 first, then latest and finally 1.3.0.

Answer 4 · 2020-06-17T15:03:00.000Z

So I ran:

oc delete -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v0.3.0.yaml

oc apply -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml

I don't know if it helps but the container created from the DaemonSet is working fine on the same node.

Only the one created from the StatefulSet is crashing...

Answer 5 · 2020-06-17T15:45:43.000Z

@max3903 CSI driver in version 0.3.0 definitely does not support Kubernetes 1.17. (See also our support matrix.) If you installed that first, the subsequent 1.3.0 installation most likely failed because of unsupported (and broken) left-overs from 0.3.0.

Can you try to install v1.3.0 from a clean slate, i.e., on a 1.17 cluster that does not come with any other (older) CSI driver versions installed beforehand?

Answer 6 · 2020-06-17T15:47:04.000Z

Even after running:

oc delete -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v0.3.0.yaml

?

Answer 7 · 2020-06-17T19:08:27.000Z

@timoreimann Installing the cluster was a pretty painful process I would like to avoid.

I removed all the csi* images from all the masters and workers:

podman image rm docker.io/digitalocean/do-csi-plugin:v1.3.0
podman image rm docker.io/digitalocean/do-csi-plugin:dev
podman image rm quay.io/k8scsi/csi-node-driver-registrar:v1.1.0
podman image rm quay.io/k8scsi/csi-resizer:v0.3.0
podman image rm quay.io/k8scsi/csi-snapshotter:v1.2.2
podman image rm quay.io/k8scsi/csi-provisioner:v1.4.0
podman image rm quay.io/k8scsi/csi-attacher:v2.0.0

and installed the correct version (1.3.0). I still get the same error.

Which left-overs am I missing?

Answer 8 · 2020-06-17T21:41:19.000Z

Check for any snapshot-related CRDs that might be remaining (kubectl get crd) and delete them.

Answer 9 · 2020-06-18T00:14:37.000Z

@timoreimann

I deleted them.

No errors when running:

kubectl apply -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml

Still the same behavior on the controller, i.e the pod csi-do-controller-0 remains in status CrashLoopBackOff.

4 out of 5 containers are in state Running but have this error message in the log:

connection.go:170] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock

The last one csi-do-plugin (digitalocean/do-csi-plugin:v1.3.0) remains in state Waiting and the logs says:

couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json": dial tcp 169.254.169.254:80: connect: connection refused (are you running on DigitalOcean droplets?)

If I replace the args at https://github.com/digitalocean/csi-digitalocean/blob/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml#L194 with:

          args :
            - "--version"

I get this message in the logs of the container:

latest - 59e354368961c4688243fc083c94b963c276e5b4 (clean)

I tried to run the container on the worker:

$ podman run digitalocean/do-csi-plugin:v1.3.0 \
    --endpoint=unix:///var/lib/csi/sockets/pluginproxy/csi.sock \
    --url=https://api.digitalocean.com/ 
    --token=3e8****ec5
time="2020-06-18T00:05:44Z" level=info msg="removing socket" host_id=196466821 region=sfo3 socket=/var/lib/csi/sockets/pluginproxy/csi.sock version=latest
2020/06/18 00:05:44 failed to listen: listen unix /var/lib/csi/sockets/pluginproxy/csi.sock: bind: no such file or directory

I also tried to use curl to create a volume through the API from the same node and it worked:

curl -X POST -H "Content-Type: application/json" \
    -H "Authorization: Bearer 3e8***ec5" \
    -d '{"size_gigabytes":10, "name": "example", "description": "Block store for examples", "region": "sfo3", "filesystem_type": "ext4", "filesystem_label": "example"}' \
    "https://api.digitalocean.com/v2/volumes"

The container from the same image on the same node from the DaemonSet is still working fine:

time="2020-06-17T23:00:21Z" level=info msg="removing socket" host_id=196466821 region=sfo3 socket=/csi/csi.sock version=latest
time="2020-06-17T23:00:21Z" level=info msg="starting server" grpc_addr=/csi/csi.sock host_id=196466821 http_addr= region=sfo3 version=latest
time="2020-06-17T23:00:22Z" level=info msg="get plugin info called" host_id=196466821 method=get_plugin_info region=sfo3 response="name:\"dobs.csi.digitalocean.com\" vendor_version:\"latest\" " version=latest
time="2020-06-17T23:00:23Z" level=info msg="node get info called" host_id=196466821 method=node_get_info region=sfo3 version=latest

FYI, all droplets are Fedora CoreOS 31 in SFO3 with this workaround to set the hostname:
coreos/fedora-coreos-tracker#538

Answer 10 · 2020-06-18T14:05:12.000Z

The couldn't get metadata is likely a red-herring due to the manual podman run which is unlike the k8s manifest regarding network namespace setup.

Answer 11 · 2020-06-18T14:58:35.000Z

@timoreimann With @lucab and @dustymabe help, I got it working by adding:

      hostNetwork: true
      securityContext:
        privileged: true

in https://github.com/digitalocean/csi-digitalocean/blob/release-1.3/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml#L142

Answer 12 · 2020-06-18T15:39:56.000Z

@max3903 glad you figured it out. 🎉
Do I understand correctly that you needed to add the hostNetwork / privileged fields to the Controller service? (We do have it set on the Node service in the manifest.)

FWIW, the manifest you referenced (and had to amend) is what we use for our end-to-end tests as-is: we deploy it into a DOKS cluster and run upstream e2e tests against. I'm confused why it didn't work for you -- wondering if there's perhaps something specific about OKD (or DOKS) that explains the difference in behavior?

Answer 13 · 2020-06-18T16:10:56.000Z

@timoreimann Yes on the controller.

@dustymabe mentioned that openshift has stricter security settings than base kubernetes.

Answer 14 · 2020-06-18T20:36:21.000Z

@dustymabe mentioned that openshift has stricter security settings than base kubernetes.

Typically that is the case. Unfortunately I don't have enough expertise to know what those extra security defaults are or if that's the cause of the issues here. I just know enough to bring up that it could be the cause.

Answer 15 · 2020-07-04T02:58:48.000Z

@timoreimann With @lucab and @dustymabe help, I got it working by adding:
      hostNetwork: true
      securityContext:
        privileged: true
in https://github.com/digitalocean/csi-digitalocean/blob/release-1.3/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml#L142

This seems to be working for me with just the hostNetwork: true change. I don't think privileged: true is needed.

Answer 16 · 2020-07-04T11:58:52.000Z

Right, privileged mode should be needed on the Node service only to allow mount propagation. I don't think we have it set on our Controller service manifest.

Answer 17 · 2020-07-04T13:12:05.000Z

If you'd like to submit a quick PR to document the need to run on host network in OKD (and perhaps leave a commented out hostNetwork: true field in the manifest), I'd be happy to review that.

Answer 18 · 2020-07-07T15:02:37.000Z

Thanks @timoreimann. Do you think it would make sense to do it by default instead of having it commented out?

Answer 19 · 2020-07-07T15:22:19.000Z

@dustymabe the only platform I'm aware of at this point that requires host networking to be enabled on the Controller service seems to be OKD. So I'm more inclined to keeping it commented out for now.
If someone could manage to find out more specific reasons why it's needed in OKD though, we could possibly better judge if it's something that other platforms / systems may be affected by as well.

Answer 20 · 2020-07-30T16:37:53.000Z

I changed the csi-do-plugin container within the pod to just sleep so I could exec in there and poke around.

/ # ip -4 -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
3: eth0    inet 10.129.0.61/23 brd 10.129.1.255 scope global eth0\       valid_lft forever preferred_lft forever

/ # busybox wget http://169.254.169.254/metadata/v1.json
Connecting to 169.254.169.254 (169.254.169.254:80)
wget: can't connect to remote host (169.254.169.254): Connection refused

It might be worth noting that OKD uses OVN networking: https://docs.openshift.com/container-platform/4.5/networking/ovn_kubernetes_network_provider/about-ovn-kubernetes.html. Unfortunately I don't know much about the networking side so I'm a bit limited in understanding this.

In order to workaround temporarily this patch command should work for users:

PATCH='                                                                       
spec:                                                                         
  template:                                                                   
    spec:                                                                     
      hostNetwork: true'                                                      
oc patch statefulset/csi-do-controller -n kube-system --type merge -p "$PATCH"

Can we change the title of this to csi-do-controller-0 CrashLoopBackOff: couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json" so others might be able to find it easier.

Answer 21 · 2020-07-30T16:45:46.000Z

@dustymabe Done!

Answer 22 · 2020-09-05T16:31:15.000Z

👋 So I've run into this issue as well using K3s on DO. I was able to finally get things running with hostnetwork: true. I'm using the default network driver of flannel but it does use containerd as the runtime.

Answer 23 · 2023-05-18T21:17:36.000Z

I can confirm that the workaround in #328 (comment) still works for me today.