openshift/openshift-ansible

OKD 3.11 - Multiple services pointing to the same ready pod take ages to be accessible

ThinkBriK opened this issue · 5 comments

Description

On a multi-master install, creating a simple pod with multiple services pointing to it cause the services to appear very late compared to the moment the pod gets ready.

Version

Openshift 3.11

Steps To Reproduce

deploy the following resources:

# Dummy Server
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: test-http
spec:
  replicas: 1
  selector:
    matchLabels:
      run: test-http
      app: test
  strategy:
    type: Recreate
  template:
    metadata:
      creationTimestamp: null
      labels:
        run: test-http
        app: test
    spec:
      containers:
      - args:
        - httpd
        - -fv
        image: busybox
        imagePullPolicy: Always
        name: test-http
        readinessProbe:
          exec:
            command:
            - /bin/echo
            - toto
          failureThreshold: 3
          # delayed readiness
          initialDelaySeconds: 15
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
      restartPolicy: Always
      terminationGracePeriodSeconds: 0

---
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    run: test-http
    app: test
  name: test
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: test
  type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    run: test-http
    app: test
  name: test2
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: test
  type: ClusterIP
---
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    run: test-http
    app: test
  name: test3
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    app: test
  type: ClusterIP
---
# Tester pod
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
  labels:
    run: test-http
    app: tester
  name: tester
spec:
  replicas: 1
  selector:
    matchLabels:
      run: test-http
      app: tester
  strategy:
    type: Recreate
  template:
    metadata:
      creationTimestamp: null
      labels:
        run: test-http
        app: tester
    spec:
      containers:
      - args:
        - /bin/sh
        - -c
        - while true; do wget -t 5 test; sleep 1; wget -T 5 test2; sleep 1; wget -T 5 test3; sleep 5; done
        image: busybox
        imagePullPolicy: Always
        name: tester
      restartPolicy: Always
      terminationGracePeriodSeconds: 0

Expected Results

Describe what you expected to happen.

tester logs should be full of 404 and test-http should record this

~ oc logs tester-6d5b5f5b45-bwpf8 
Connecting to test (172.30.5.0:80)
wget: server returned error: HTTP/1.0 404 Not Found
Connecting to test2 (172.30.166.168:80)
wget: server returned error: HTTP/1.0 404 Not Found
Connecting to test3 (172.30.160.79:80)
wget: server returned error: HTTP/1.0 404 Not Found
Observed Results

Describe what is actually happening.

I'm getting No Route to host except for the first service initially, other services eventually gradually appear

Connecting to test (172.30.5.0:80)
wget: can't connect to remote host (172.30.5.0): No route to host
Connecting to test2 (172.30.166.168:80)
wget: server returned error: HTTP/1.0 404 Not Found
Connecting to test3 (172.30.160.79:80)
wget: can't connect to remote host (172.30.160.79): No route to host
Connecting to test (172.30.5.0:80)
wget: can't connect to remote host (172.30.5.0): No route to host
Connecting to test2 (172.30.166.168:80)
wget: server returned error: HTTP/1.0 404 Not Found
Connecting to test3 (172.30.160.79:80)
...
Additional Information

Provide any additional information which may help us diagnose the
issue.

  • Your operating system and version: CentOS Linux release 7.7.1908 (Core)
EXTRA INFORMATION GOES HERE

It seems that IPtables rules are created only for a single service

# iptables -t nat -L | grep 'mynamespace/test'
KUBE-MARK-MASQ  all  --  ip-10-130-140-7.eu-west-3.compute.internal  anywhere             /* mynamespace/test2:http */
DNAT       tcp  --  anywhere             anywhere             /* mynamespace/test2:http */ tcp to:10.130.140.7:80
KUBE-MARK-MASQ  tcp  -- !ip-10-128-0-0.eu-west-3.compute.internal/14  test2.mynamespace.svc.cluster.local  /* mynamespace/test2:http cluster IP */ tcp dpt:http
KUBE-SVC-4EDXJG65XWML2DSW  tcp  --  anywhere             test2.mynamespace.svc.cluster.local  /* mynamespace/test2:http cluster IP */ tcp dpt:http
KUBE-SEP-MPCOJF5FIFQHGSLF  all  --  anywhere             anywhere             /* mynamespace/test2:http */

Update:
Found out that endpoint did not manage to be created for non-working services !
Log of a working service:

{"log":"I1020 08:57:22.785503   84796 proxy.go:261] hybrid proxy: (always) update ep mynamespace/test3 in unidling proxy\n","stream":"stderr","time":"2020-10-20T08:57:22.78589115Z"}
{"log":"I1020 08:57:22.785508   84796 roundrobin.go:310] LoadBalancerRR: Setting endpoints for mynamespace/test3:http to [10.130.140.27:80]\n","stream":"stderr","time":"2020-10-20T08:57:22.785894097Z"}
{"log":"I1020 08:57:22.785513   84796 roundrobin.go:240] Delete endpoint 10.130.140.27:80 for service \"mynamespace/test3:http\"\n","stream":"stderr","time":"2020-10-20T08:57:22.785896856Z"}
{"log":"I1020 08:57:22.785519   84796 proxy.go:283] hybrid proxy: update ep mynamespace/test3 in main proxy\n","stream":"stderr","time":"2020-10-20T08:57:22.785899713Z"}
{"log":"I1020 08:57:22.785525   84796 endpoints.go:234] Setting endpoints for \"mynamespace/test3:http\" to []\n","stream":"stderr","time":"2020-10-20T08:57:22.785902388Z"}
{"log":"I1020 08:57:22.785535   84796 endpoints.go:234] Setting endpoints for \"mynamespace/test3:http\" to [10.130.140.27:80]\n","stream":"stderr","time":"2020-10-20T08:57:22.785905229Z"}
{"log":"-A KUBE-SERVICES -m comment --comment \"mynamespace/test3:http cluster IP\" -m tcp -p tcp -d 172.30.38.204/32 --dport 80 ! -s 10.128.0.0/14 -j KUBE-MARK-MASQ\n","stream":"stderr","time":"2020-10-20T08:57:22.81362588Z"}
{"log":"-A KUBE-SERVICES -m comment --comment \"mynamespace/test3:http cluster IP\" -m tcp -p tcp -d 172.30.38.204/32 --dport 80 -j KUBE-SVC-IKGWW46TKWZ3CTQH\n","stream":"stderr","time":"2020-10-20T08:57:22.813628775Z"}
{"log":"-A KUBE-SVC-IKGWW46TKWZ3CTQH -m comment --comment mynamespace/test3:http -j KUBE-SEP-Z35DPKKXT4TOKDBN\n","stream":"stderr","time":"2020-10-20T08:57:22.81363166Z"}
{"log":"-A KUBE-SEP-Z35DPKKXT4TOKDBN -m comment --comment mynamespace/test3:http -s 10.130.140.27/32 -j KUBE-MARK-MASQ\n","stream":"stderr","time":"2020-10-20T08:57:22.81363616Z"}
{"log":"-A KUBE-SEP-Z35DPKKXT4TOKDBN -m comment --comment mynamespace/test3:http -m tcp -p tcp -j DNAT --to-destination 10.130.140.27:80\n","stream":"stderr","time":"2020-10-20T08:57:22.813638985Z"}
{"log":"I1020 08:57:22.821700   84796 healthcheck.go:235] Not saving endpoints for unknown healthcheck \"mynamespace/test3\"\n","stream":"stderr","time":"2020-10-20T08:57:22.821765899Z"}
{"log":"I1020 08:57:29.425753   84796 serviceresolver.go:161] Answered test3.mynamespace.svc.cluster.local.:false with msg.Service{Host:\"172.30.38.204\", Port:0, Priority:10, Weight:10, Text:\"\", Mail:false, Ttl:0x1e, TargetStrip:0, Group:\"\", Key:\"/skydns/local/cluster/svc/mynamespace/test3/ec1415c3\"}\n","stream":"stderr","time":"2020-10-20T08:57:29.425804747Z"}
{"log":"I1020 08:57:29.426074   84796 serviceresolver.go:161] Answered test3.mynamespace.svc.cluster.local.:false with msg.Service{Host:\"172.30.38.204\", Port:0, Priority:10, Weight:10, Text:\"\", Mail:false, Ttl:0x1e, TargetStrip:0, Group:\"\", Key:\"/skydns/local/cluster/svc/mynamespace/test3/ec1415c3\"}\n","stream":"stderr","time":"2020-10-20T08:57:29.42610662Z"}

Log of a non-working service

{"log":"I1020 08:57:22.787821   84796 proxy.go:261] hybrid proxy: (always) update ep mynamespace/test2 in unidling proxy\n","stream":"stderr","time":"2020-10-20T08:57:22.78786746Z"}
{"log":"I1020 08:57:22.787834   84796 roundrobin.go:310] LoadBalancerRR: Setting endpoints for mynamespace/test2:http to [10.130.140.27:80]\n","stream":"stderr","time":"2020-10-20T08:57:22.78787164Z"}
{"log":"I1020 08:57:22.787846   84796 roundrobin.go:240] Delete endpoint 10.130.140.27:80 for service \"mynamespace/test2:http\"\n","stream":"stderr","time":"2020-10-20T08:57:22.787883266Z"}
{"log":"I1020 08:57:22.787866   84796 proxy.go:283] hybrid proxy: update ep mynamespace/test2 in main proxy\n","stream":"stderr","time":"2020-10-20T08:57:22.787889936Z"}
{"log":"I1020 08:57:22.787877   84796 endpoints.go:234] Setting endpoints for \"mynamespace/test2:http\" to []\n","stream":"stderr","time":"2020-10-20T08:57:22.78789457Z"}
{"log":"I1020 08:57:22.787893   84796 endpoints.go:234] Setting endpoints for \"mynamespace/test2:http\" to [10.130.140.27:80]\n","stream":"stderr","time":"2020-10-20T08:57:22.787929726Z"}
{"log":"-A KUBE-SERVICES -m comment --comment \"mynamespace/test2:http has no endpoints\" -m tcp -p tcp -d 172.30.120.81/32 --dport 80 -j REJECT\n","stream":"stderr","time":"2020-10-20T08:57:22.811561709Z"}
{"log":"I1020 08:57:25.417747   84796 serviceresolver.go:161] Answered test2.mynamespace.svc.cluster.local.:false with msg.Service{Host:\"172.30.120.81\", Port:0, Priority:10, Weight:10, Text:\"\", Mail:false, Ttl:0x1e, TargetStrip:0, Group:\"\", Key:\"/skydns/local/cluster/svc/mynamespace/test2/bd537d1a\"}\n","stream":"stderr","time":"2020-10-20T08:57:25.417810313Z"}
{"log":"I1020 08:57:25.418124   84796 serviceresolver.go:161] Answered test2.mynamespace.svc.cluster.local.:false with msg.Service{Host:\"172.30.120.81\", Port:0, Priority:10, Weight:10, Text:\"\", Mail:false, Ttl:0x1e, TargetStrip:0, Group:\"\", Key:\"/skydns/local/cluster/svc/mynamespace/test2/bd537d1a\"}\n","stream":"stderr","time":"2020-10-20T08:57:25.418163831Z"}

Endpoints are not reflected correctly during service creation via iptables rules (API objects do exist)

Is here any mean of retrieving kube-proxy logs ? (found it and updated first post)

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.