istio/old_issues_repo

Service api show "upstream connect error", until you restart istio-pilot.

Closed this issue · 5 comments

Bug: Y
(Seems sometimes the new pod ip cannot be updated after new service deployment, until you restart istio-pilot)

Environment:
istioctl version 0.7.1 (Istio Auth not enabled )
kubectl version 1.9.5
Network is calico

Raise rate:
Raise about one time for one day cluster running. (1 time issue raise / 1000 times new service deployment)

Steps
1 Continue to deploy/run new services

2 Access one service url: https://ing.xxxx.xxx.xxxx.com/store/api/health
Expect result: 200 with message {"status":"UP"}
Actual result: 503 with message (upstream connect error or disconnect/reset before headers)

3 From Istio-ingrss log, seems pod ip is 10.233.125.40:3000
... ...
[2018-05-30T06:24:52.741Z] "GET /store/api/health HTTP/1.1" 503 UF 0 57 1001 - "10.233.90.192" "curl/7.47.0" "e05ccecb-63b6-9eba-af20-46423538f61e"
"ing.xxxx.xxx.xxxx.com" "10.233.125.40:3000"
... ...

4 From app service, seems pod ip is 10.233.82.29:3000, different with step 3
kubectl describe svc hp-store-service -n hp
Name: hp-store-service
Namespace: hp
Labels: app=hp-store-service
Annotations: prometheus.io/path=/prometheus
prometheus.io/port=9090
prometheus.io/probe=true
prometheus.io/scrape=true
Selector: app=hp-store
Type: ClusterIP
IP: 10.233.47.221
Port: http 80/TCP
TargetPort: 3000/TCP
Endpoints: 10.233.82.29:3000
Session Affinity: None
Events:

5 Continue to show this issue in next 1 hour, it cannot recovery

6 I restart istio-pilot
test1b@ip-172-31-17-153:~$ kubectl delete pod istio-pilot-67d6ddbdf6-c6xb6 -n istio-system
pod "istio-pilot-67d6ddbdf6-c6xb6" deleted

7 Wait one minute, access https://ing.xxxx.xxx.xxxx.com/store/api/health, now it return 200 {"status":"UP"}

8 From istio-pilot new log, seems pod ip turns correct now.
... ...
[2018-05-30T06:25:35.991Z] "GET /store/api/health HTTP/1.1" 200 - 0 16 37 36 "10.233.125.0" "curl/7.47.0" "84823019-f6b0-9021-b501-963376ec3516"
"ing.xxxx.xxx.xxxx.com" "10.233.82.29:3000"
... ...

@mandarjog this sounds similar to

istio/istio#5391

Any logs from pilot ? I don't think it's the same problem - it looks like some (bad) endpoint is sent to envoy, in the other bug envoy wouldn't get any endpoint assignment. While investigating that we found few other cases that could be affected by the same problem - so I would say different behavior but same root cause and likely fixed in 0.8.

There are few additional debug endpoints in 0.8 - including "/debug/endpointz?brief=true" that lists the pilot's view of endpoints, can be used to find if the problem was on ingestion side or on pushing.

certainly stale endpoint, which we had seen with missed updates. Symptoms are similar. Like costin says, very likely fixed in 0.8.

As @costinm suggests can you upgrade to 0.8 and see if this resolves your issue.

Close it since it is old istio version.
Actually, both 0.8 and 1.0 still raise 503 issue. However, I will track other places.