Envoy init hang: EDS must respond to all requests
Closed this issue · 10 comments
Envoy goes thru the following init sequence
- Load clusters with CDS
- Load endpoint with EDS (bidi grpc stream)
- lds
- rds
Until this sequence finishes envoy is not operational.
If Pilot fails to respond to an EDS request, envoy waits indefinitely and init process never finishes.
Such an envoy has no listeners and it will reject all inbound and outbound traffic.
- Pilot must ensure that every EDS request has a response.
- Envoy must have a timeout during the init process. If init process does not finish in a reasonable amount of time, envoy should either
a. Error out and die.
b. Continue with what it has
c. or configurable.
Note: Check and fix the same issue in ADS.
Under high transience, this is more likely to occur.
This has been reported by several customers. The symptom is that some or many proxy-injected pods start crash looping.
If EDS blocks or deadlocks
- Existing envoys will stop receiving new configuration. Envoys keep on operating using stale configuration.
- New envoys that start will never finish initialization and stop working.
Sounds exactly like what I saw!
We are back porting the fix to 0.7.
We will close the issue after that is done.
Since the PR is merged into 0.8, this is no longer blocking the 0.8 release.
@mandarjog what is pending to close this issue?
Potentially back porting to 0.7.x. If we are not going to backport, then let's close this.
0.8 is fine for now. We can consider 0.7.x if someone explicitly asks for it.