ceph/ceph-helm

the DNS pod can not resolve ceph-monitor'name ceph-mon.ceph.svc.cluster.local

zhangdaolong opened this issue ยท 6 comments

Is this a request for help?: yes


Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

the container alawys show "1 dns.go:555] Could not find endpoints for service "ceph-mon" in namespace "ceph". DNS records will be created once endpoints show up. " in pod kube-dns-85bc874cc5-mdzhb
"

[root@master ceph]# helm install --name=ceph local/ceph --namespace=ceph
NAME: ceph
LAST DEPLOYED: Tue Jun 12 09:53:41 2018
NAMESPACE: ceph
STATUS: DEPLOYED

RESOURCES:
==> v1beta1/DaemonSet
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ceph-mon 1 1 0 1 0 ceph-mon=enabled 1s
ceph-osd-dev-sda 1 1 0 1 0 ceph-osd-device-dev-sda=enabled,ceph-osd=enabled 1s

==> v1beta1/Deployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
ceph-mds 1 1 1 0 1s
ceph-mgr 1 1 1 0 1s
ceph-mon-check 1 1 1 0 1s
ceph-rbd-provisioner 2 2 2 0 1s
ceph-rgw 1 1 1 0 1s

==> v1/Job
NAME DESIRED SUCCESSFUL AGE
ceph-mon-keyring-generator 1 0 1s
ceph-mds-keyring-generator 1 0 1s
ceph-osd-keyring-generator 1 0 1s
ceph-mgr-keyring-generator 1 0 1s
ceph-rgw-keyring-generator 1 0 1s
ceph-namespace-client-key-generator 1 0 1s
ceph-storage-keys-generator 1 0 1s

==> v1/Pod(related)
NAME READY STATUS RESTARTS AGE
ceph-mon-rsjkn 0/3 Init:0/2 0 1s
ceph-osd-dev-sda-jb8s7 0/1 Init:0/3 0 1s
ceph-mds-696bd98bdb-92tj2 0/1 Init:0/2 0 1s
ceph-mgr-56f45bb99c-pmpfm 0/1 Pending 0 1s
ceph-mon-check-74d98c5b95-k5xc5 0/1 Pending 0 1s
ceph-rbd-provisioner-b58659dc9-llllj 0/1 Pending 0 1s
ceph-rbd-provisioner-b58659dc9-rh4zd 0/1 ContainerCreating 0 1s
ceph-rgw-5bd9dd66c5-q5vzp 0/1 Pending 0 1s
ceph-mon-keyring-generator-nzg2l 0/1 Pending 0 1s
ceph-mds-keyring-generator-cr8ql 0/1 Pending 0 1s
ceph-osd-keyring-generator-z5jrq 0/1 Pending 0 1s
ceph-mgr-keyring-generator-kw2wj 0/1 Pending 0 1s
ceph-rgw-keyring-generator-6kghm 0/1 Pending 0 1s
ceph-namespace-client-key-generator-dk968 0/1 Pending 0 1s
ceph-storage-keys-generator-4mhhk 0/1 Pending 0 1s

==> v1/Secret
NAME TYPE DATA AGE
ceph-keystone-user-rgw Opaque 7 1s

==> v1/ConfigMap
NAME DATA AGE
ceph-bin-clients 2 1s
ceph-bin 26 1s
ceph-etc 1 1s
ceph-templates 5 1s

==> v1/StorageClass
NAME PROVISIONER AGE
ceph-rbd ceph.com/rbd 1s

==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ceph-mon ClusterIP None 6789/TCP 1s
ceph-rgw ClusterIP 10.109.46.173 8088/TCP 1s

[root@master ceph]# kubectl exec kube-dns-85bc874cc5-mdzhb -ti -n kube-system -c kubedns -- sh
/ # ps
PID USER TIME COMMAND
1 root 3:19 /kube-dns --domain=172.16.34.88. --dns-port=10053 --config-dir=/kube-dns-config --v=2
26 root 0:35 ping ceph-mon.ceph.svc.cluster.local
157 root 0:00 sh
161 root 0:00 sh
165 root 0:00 sh
/ #
/ # ping ceph-mon.ceph.svc.cluster.local
ping: bad address 'ceph-mon.ceph.svc.cluster.local'
/ #

[root@master ceph]# kubectl get pod --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
ceph ceph-mds-696bd98bdb-rvq42 0/1 CrashLoopBackOff 6 11m
ceph ceph-mds-keyring-generator-6nrct 0/1 Completed 0 11m
ceph ceph-mgr-56f45bb99c-smqqj 0/1 CrashLoopBackOff 6 11m
ceph ceph-mgr-keyring-generator-kdjd4 0/1 Completed 0 11m
ceph ceph-mon-check-74d98c5b95-nqhmg 1/1 Running 0 11m
ceph ceph-mon-keyring-generator-7xmd8 0/1 Completed 0 11m
ceph ceph-mon-m72hp 3/3 Running 0 11m
ceph ceph-namespace-client-key-generator-cvnpw 0/1 Completed 0 11m
ceph ceph-osd-dev-sda-kzn65 0/1 Init:CrashLoopBackOff 6 11m
ceph ceph-osd-keyring-generator-48gb6 0/1 Completed 0 11m
ceph ceph-rbd-provisioner-b58659dc9-7jsnk 1/1 Running 0 11m
ceph ceph-rbd-provisioner-b58659dc9-sf6hr 1/1 Running 0 11m
ceph ceph-rgw-5bd9dd66c5-n25bn 0/1 CrashLoopBackOff 6 11m
ceph ceph-rgw-keyring-generator-vs8th 0/1 Completed 0 11m
ceph ceph-storage-keys-generator-ww7hn 0/1 Completed 0 11m
default busybox 1/1 Running 113 4d
kube-system etcd-master 1/1 Running 8 24d
kube-system heapster-69b5d4974d-9g96p 1/1 Running 10 24d
kube-system kube-apiserver-master 1/1 Running 8 24d
kube-system kube-controller-manager-master 1/1 Running 8 24d
kube-system kube-dns-85bc874cc5-mdzhb 3/3 Running 27 24d
kube-system kube-flannel-ds-b94c4 1/1 Running 12 24d
kube-system kube-flannel-ds-sqzwv 1/1 Running 10 24d
kube-system kube-proxy-9j6sq 1/1 Running 10 24d
kube-system kube-proxy-znkxj 1/1 Running 7 24d
kube-system kube-scheduler-master 1/1 Running 8 24d
kube-system kubernetes-dashboard-7d5dcdb6d9-c2sz6 1/1 Running 10 24d
kube-system monitoring-grafana-69df66f668-fpgn5 1/1 Running 10 24d
kube-system monitoring-influxdb-78d4c6f5b6-hnjg2 1/1 Running 50 24d
kube-system tiller-deploy-f9b8476d-trtml 1/1 Running 0 4d

Version of Helm and Kubernetes:

[root@master ceph]# helm version
Client: &version.Version{SemVer:"v2.9.1", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.9.1", GitCommit:"20adb27c7c5868466912eebdf6664e7390ebe710", GitTreeState:"clean"}

[root@master ceph]# kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Which chart:

What happened:
DNS pod can not resolve ceph-mon.ceph.svc.cluster.local

What you expected to happen:
DNS pod can resolve ceph-mon.ceph.svc.cluster.local

How to reproduce it (as minimally and precisely as possible):
always

Anything else we need to know:
None

what do you see in ceph-mon pod logs? kubectl logs -n ceph ceph-mon-xxxx

can you check why ceph-mon service has no cluster ip?

==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ceph-mon ClusterIP None 6789/TCP 1s

check the /etc/resolv.conf of the pod that couldn't resolve ceph-mon, make sure that nameserver x.x.x.x exists, where x.x.x.x is the IP of kube-dns, you can also check that IP by by using "kubectl get svc -n kube-system", make sure it matches, also make sure that kube-dns pod is running.

I seem to have the same or a similar problem.
From none of the pods I can resolve the service name, like in this case from inside of ceph-mon:

~# kubectl exec -n ceph -ti ceph-mon-cqwzq -c ceph-mon -- ceph -s
server name not found: ceph-mon.ceph.svc.cluster.local (Temporary failure in name resolution)
unable to parse addrs in 'ceph-mon.ceph.svc.cluster.local'
InvalidArgumentError does not take keyword arguments
command terminated with exit code 1

I guess this is the reason, why my OSDs are failing during init:

kubectl -n ceph logs ceph-osd-dev-sda-xtrm6 -c osd-prepare-pod
+ export LC_ALL=C
+ LC_ALL=C
+ source variables_entrypoint.sh
++ ALL_SCENARIOS='osd osd_directory osd_directory_single osd_ceph_disk osd_ceph_disk_prepare osd_ceph_disk_activate osd_ceph_activate_journal mgr'
++ : ceph
++ : ceph-config/ceph
++ :
++ : osd_ceph_disk_prepare
++ : 1
++ : hive-02
++ : hive-02
++ : /etc/ceph/monmap-ceph
++ : /var/lib/ceph/mon/ceph-hive-02
++ : 0
++ : 0
++ : mds-hive-02
++ : 0
++ : 100
++ : 0
++ : 0
+++ uuidgen
++ : eaddd16b-3a95-4f4c-ba8a-161be9306f42
+++ uuidgen
++ : 5c1c63f8-caa9-4c79-8158-df86aa87df4b
++ : root=default host=hive-02
++ : 0
++ : cephfs
++ : cephfs_data
++ : 8
++ : cephfs_metadata
++ : 8
++ : hive-02
++ :
++ :
++ : 8080
++ : 0
++ : 9000
++ : 0.0.0.0
++ : cephnfs
++ : hive-02
++ : 0.0.0.0
++ CLI_OPTS='--cluster ceph'
++ DAEMON_OPTS='--cluster ceph --setuser ceph --setgroup ceph -d'
++ MOUNT_OPTS='-t xfs -o noatime,inode64'
++ MDS_KEYRING=/var/lib/ceph/mds/ceph-mds-hive-02/keyring
++ ADMIN_KEYRING=/etc/ceph/ceph.client.admin.keyring
++ MON_KEYRING=/etc/ceph/ceph.mon.keyring
++ RGW_KEYRING=/var/lib/ceph/radosgw/hive-02/keyring
++ MGR_KEYRING=/var/lib/ceph/mgr/ceph-hive-02/keyring
++ MDS_BOOTSTRAP_KEYRING=/var/lib/ceph/bootstrap-mds/ceph.keyring
++ RGW_BOOTSTRAP_KEYRING=/var/lib/ceph/bootstrap-rgw/ceph.keyring
++ OSD_BOOTSTRAP_KEYRING=/var/lib/ceph/bootstrap-osd/ceph.keyring
++ OSD_PATH_BASE=/var/lib/ceph/osd/ceph
+ source common_functions.sh
++ set -ex
+ is_available rpm
+ command -v rpm
+ is_available dpkg
+ command -v dpkg
+ OS_VENDOR=ubuntu
+ source /etc/default/ceph
++ TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728
+ case "$CEPH_DAEMON" in
+ OSD_TYPE=prepare
+ start_osd
+ [[ ! -e /etc/ceph/ceph.conf ]]
+ '[' 1 -eq 1 ']'
+ [[ ! -e /etc/ceph/ceph.client.admin.keyring ]]
+ case "$OSD_TYPE" in
+ source osd_disk_prepare.sh
++ set -ex
+ osd_disk_prepare
+ [[ -z /dev/sda ]]
+ [[ ! -e /dev/sda ]]
+ '[' '!' -e /var/lib/ceph/bootstrap-osd/ceph.keyring ']'
+ timeout 10 ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring health
+ exit 1

@feresberbeche Here's my resolv.conf of ceph-mon. The nameserver fits the kube-dns ip:

nameserver 10.96.0.10
search ceph.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

@rootfs ceph-mon service also has no cluster ip in my case. How can I check, why this is the case?
At least here is the description of the service:

~# kubectl describe service ceph-mon -n ceph
Name:              ceph-mon
Namespace:         ceph
Labels:            <none>
Annotations:       service.alpha.kubernetes.io/tolerate-unready-endpoints=true
Selector:          application=ceph,component=mon,release_group=ceph
Type:              ClusterIP
IP:                None
Port:              <unset>  6789/TCP
TargetPort:        6789/TCP
Endpoints:         <redacted public ip>:6789,<redacted public ip>:6789,<redacted public ip>:6789
Session Affinity:  None
Events:            <none>

The log of the ceph-mon pod looks fine to me. I uploaded it here: https://gist.github.com/Silberschleier/1baad5d4853c48abeff3b1326b5cc7db

Hyvi commented

the same problem happaned, everything looks OK ,but I can't ping dns IP and other service

Hackish solution: On the nodes running & mounting ceph add ceph-mon-discovery.ceph.svc.cluster.local to /etc/hosts e.g.

kubectl describe service ceph-mon -n ceph
Name:              ceph-mon
Namespace:         ceph
Labels:            <none>
Annotations:       service.alpha.kubernetes.io/tolerate-unready-endpoints: true
Selector:          application=ceph,component=mon,release_group=ceph
Type:              ClusterIP
IP:                None
Port:              <unset>  6789/TCP
TargetPort:        6789/TCP
Endpoints:         172.20.1.60:6789
Session Affinity:  None
Events:            <none>
echo '172.20.1.60	ceph-mon.ceph.svc.cluster.local' >> /etc/hosts

A better way would be to add kube-dns to the hosts name resolution ( kubectl -n kube-system get svc/kube-dns).

spuny commented

Hi,

yours "hacky solution" saved my life for the moment.
After some digging I discovered that there is NO IP on service:

kind: Service
apiVersion: v1
metadata:
  name: {{ tuple "ceph_mon" "internal" . | include "helm-toolkit.endpoints.hostname_short_endpoint_lookup" }}
spec:
  ports:
  - port: {{ tuple "ceph_mon" "internal" "mon" $envAll | include "helm-toolkit.endpoints.endpoint_port_lookup" }}
    protocol: TCP
    targetPort: {{ tuple "ceph_mon" "internal" "mon" $envAll | include "helm-toolkit.endpoints.endpoint_port_lookup" }}
  selector:
{{ tuple $envAll "ceph" "mon" | include "helm-toolkit.snippets.kubernetes_metadata_labels" | indent 4 }}
  clusterIP: None
{{- end }}

So another "hacky" solution is to delete line:
clusterIP: None

Than, you can ping/nslookup it with name: ceph-mon.ceph.svc.clusterl.local. - This is tested on lab.

UPDATE
This can be marked as solved I suppose. I just happen to solve it.
Use another network plugin. Instead of weave i used calico with modification of resolv.conf of nodes:
nameserver 10.233.0.3 nameserver 8.8.8.8

And it magicaly started working.