ig job stays in Completed state for a long time

Question

ig job stays in Completed state for a long time

f0rmiga opened this issue 4 years ago · 1 comments

I experienced an ig job that got into the Completed state and stayed there for a few minutes. Eventually, the controllers picked its output and moved the cluster state forward. While it was in this Completed state, the KubeCF cluster was broken, with a few pods delete. E.g. the following is the pod list in an HA deployment. Notice the missing api, uaa, diego-cell and router replicas.

NAME                                     READY   STATUS      RESTARTS   AGE
api-0                                    15/15   Running     5          24m
auctioneer-0                             4/4     Running     1          28m
bosh-dns-755d6b884b-cwqgw                1/1     Running     0          13m
bosh-dns-755d6b884b-h92mh                1/1     Running     0          13m
cc-worker-0                              4/4     Running     2          27m
cf-apps-dns-564fc5cf4d-jzbcv             1/1     Running     0          14m
cf-apps-dns-564fc5cf4d-qnw46             1/1     Running     0          14m
credhub-0                                6/6     Running     0          27m
credhub-1                                6/6     Running     0          29m
database-0                               2/2     Running     0          13m
database-seeder-8f24862205dd7db3-46p5n   0/2     Completed   0          118m
diego-api-0                              6/6     Running     2          28m
diego-cell-0                             7/7     Running     2          22m
diego-cell-1                             7/7     Running     1          25m
doppler-0                                4/4     Running     0          27m
doppler-1                                4/4     Running     0          27m
doppler-2                                4/4     Running     0          28m
ig-a01395ca9859fa55-rv65v                0/22    Completed   0          13m
log-api-0                                7/7     Running     0          27m
log-cache-0                              8/8     Running     0          28m
nats-0                                   4/4     Running     0          28m
nats-1                                   4/4     Running     0          28m
router-0                                 5/5     Running     0          27m
routing-api-0                            4/4     Running     2          27m
scheduler-0                              10/10   Running     6          27m
tcp-router-0                             5/5     Running     0          28m
uaa-0                                    7/7     Running     0          25m

The following is a dump of the cf-operator and quarks-job controllers:

cf_operator_logs.txt
quarks_job_logs.txt

Answer 1 · 2020-10-13T19:58:15.000Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/175255805

The labels on this github issue will be updated when the story is started.