openshift/ansible-service-broker

Creating default "automationbroker" CR results in frozen ASB operator stuck on "injecting owner reference"

djwhatle opened this issue · 12 comments

Bug: Creating default "automationbroker" CR results in frozen ASB operator

What happened

  • Set up OpenShift v3.11 cluster with OLM and latest ASB CSV
  • Created automation-broker CR with default values in automation-broker namespace
  • automation-broker-operator container freezes while running ASB creation playbook

CR I created with default params (doesn't work):

apiVersion: automationbroker.io/v1alpha1
kind: AutomationBroker
metadata:
  name: automation-broker
  namespace: automation-broker
spec:
  createBrokerNamespace: 'false'
  waitForBroker: 'false'

ASB Operator Logs:

time="2018-11-26T16:31:32Z" level=info msg="Go Version: go1.10.4"
time="2018-11-26T16:31:32Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-11-26T16:31:32Z" level=info msg="operator-sdk Version: 0.0.6+git"
time="2018-11-26T16:31:32Z" level=info msg="Starting to serve on 127.0.0.1:8888\n"
time="2018-11-26T16:31:32Z" level=info msg="Watching automationbroker.io/v1alpha1, AutomationBroker, default"
{"level":"info","ts":1543249892.549802,"logger":"kubebuilder.controller","caller":"controller/controller.go:120","msg":"Starting EventSource","Controller":"automationbroker-controller","Source":{"Type":{"apiVersion":"automationbroker.io/v1alpha1","kind":"AutomationBroker"}}}
{"level":"info","ts":1543249892.5502899,"logger":"kubebuilder.controller","caller":"controller/controller.go:120","msg":"Starting EventSource","Controller":"automationbroker-controller","SourceError":"json: unsupported type: <-chan event.GenericEvent"}
{"level":"info","ts":1543249892.6506014,"logger":"kubebuilder.controller","caller":"controller/controller.go:134","msg":"Starting Controller","Controller":"automationbroker-controller"}
{"level":"info","ts":1543249892.7509446,"logger":"kubebuilder.controller","caller":"controller/controller.go:153","msg":"Starting workers","Controller":"automationbroker-controller","WorkerCount":1}
time="2018-11-26T16:31:45Z" level=info msg="dropping event that is not a JobEvent" component=eventapi job=8258902912821821880
time="2018-11-26T16:31:46Z" level=info msg="[playbook task]: automation-broker : Set broker namespace state=present" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:46Z" level=info msg="[playbook task]: automation-broker : Verify service catalog is installed" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:48Z" level=error msg="[failed]: [playbook task] 'Verify service catalog is installed' failed with task_args - msg=Service Catalog must be installed, that=[u'\"servicecatalog.k8s.io\" in lookup(\"k8s\", cluster_info=\"api_groups\")']" component=logging_event_handler event_type=runner_on_failed gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:48Z" level=info msg="dropping event that is not a JobEvent" component=eventapi job=8258902912821821880
time="2018-11-26T16:31:48Z" level=error msg="error from ansible-runner: exit status 2" component=runner job=8258902912821821880 name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:48Z" level=info msg="dropping event that is not a JobEvent" component=eventapi job=5459287199784753161
time="2018-11-26T16:31:49Z" level=info msg="[playbook task]: automation-broker : Set broker namespace state=present" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:49Z" level=info msg="[playbook task]: automation-broker : Verify service catalog is installed" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:51Z" level=error msg="[failed]: [playbook task] 'Verify service catalog is installed' failed with task_args - msg=Service Catalog must be installed, that=[u'\"servicecatalog.k8s.io\" in lookup(\"k8s\", cluster_info=\"api_groups\")']" component=logging_event_handler event_type=runner_on_failed gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:51Z" level=info msg="dropping event that is not a JobEvent" component=eventapi job=5459287199784753161
time="2018-11-26T16:31:51Z" level=error msg="error from ansible-runner: exit status 2" component=runner job=5459287199784753161 name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:51Z" level=info msg="dropping event that is not a JobEvent" component=eventapi job=2015931357020424987
time="2018-11-26T16:31:52Z" level=info msg="[playbook task]: automation-broker : Set broker namespace state=present" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:52Z" level=info msg="[playbook task]: automation-broker : Verify service catalog is installed" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:53Z" level=error msg="[failed]: [playbook task] 'Verify service catalog is installed' failed with task_args - msg=Service Catalog must be installed, that=[u'\"servicecatalog.k8s.io\" in lookup(\"k8s\", cluster_info=\"api_groups\")']" component=logging_event_handler event_type=runner_on_failed gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:53Z" level=info msg="dropping event that is not a JobEvent" component=eventapi job=2015931357020424987
time="2018-11-26T16:31:53Z" level=error msg="error from ansible-runner: exit status 2" component=runner job=2015931357020424987 name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:53Z" level=info msg="dropping event that is not a JobEvent" component=eventapi job=446230090832441730
time="2018-11-26T16:31:55Z" level=info msg="[playbook task]: automation-broker : Set broker namespace state=present" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:55Z" level=info msg="[playbook task]: automation-broker : Verify service catalog is installed" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:55Z" level=error msg="[failed]: [playbook task] 'Verify service catalog is installed' failed with task_args - msg=Service Catalog must be installed, that=[u'\"servicecatalog.k8s.io\" in lookup(\"k8s\", cluster_info=\"api_groups\")']" component=logging_event_handler event_type=runner_on_failed gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:55Z" level=info msg="dropping event that is not a JobEvent" component=eventapi job=446230090832441730
time="2018-11-26T16:31:55Z" level=error msg="error from ansible-runner: exit status 2" component=runner job=446230090832441730 name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:56Z" level=info msg="dropping event that is not a JobEvent" component=eventapi job=583322328950460987
time="2018-11-26T16:31:57Z" level=info msg="[playbook task]: automation-broker : Set broker namespace state=present" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:57Z" level=info msg="[playbook task]: automation-broker : Verify service catalog is installed" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:57Z" level=info msg="[playbook task]: automation-broker : Verify broker namespace exists" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:57Z" level=info msg="[playbook task]: automation-broker : Verify the broker has no outstanding service instances if deprovisioning" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:57Z" level=info msg="[playbook task]: automation-broker : Create dashboard redirector route if OpenShift" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" name=automation-broker namespace=automation-broker
time="2018-11-26T16:31:59Z" level=info msg="injecting owner reference"

What you expected to happen

ASB Operator provisions broker successfully in response to automation-broker CR being created, even when default params are provided.

Workaround

Creating this CR instead will result in a working Automation Broker instance:

apiVersion: "automationbroker.io/v1alpha1"                                        
kind: "AutomationBroker"                                                          
metadata:                                                                          
  name: automation-broker                                                        
  namespace: automation-broker                                                    
spec:                                                                                                                   
  createBrokerNamespace: false                                                    
  waitForBroker: false                                                            
  brokerLocalOpenshiftEnabled: true

How to reproduce it

Set up a v3.11 env in catasb, subscribe the automation-broker namespace to the Automation Broker Operator, and create the default automationbroker CR to see the freeze.

I need to determine what about the modified CR leads to a successful provision. Unfortunately, pretty inconvenient thing to test since I haven't found a way to retry without resetting the whole cluster.

sounds similar to this one #1105

Is it possible that the broker operator needs to be rebuilt with the latest ansible operator image?

@djzager definitely possible. To my knowledge, hasn't been rebuilt in a while.

P.S. had a conversation with @mhrivnak on this and found out that it's possible to clean up in this scenario by manually editing the automationbroker CR and deleting the finalizers section. Should allow for quicker troubleshooting.

@djzager I rebuilt the ASB operator image using the latest AO base image from water-hole and tried provisioning with the default CR again... still have the same issue with lockup.

Note the updated operator-sdk version number in these new logs:

time="2018-11-26T20:39:58Z" level=info msg="Watching automation-broker namespace."
time="2018-11-26T20:39:58Z" level=info msg="Go Version: go1.10.5"
time="2018-11-26T20:39:58Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-11-26T20:39:58Z" level=info msg="operator-sdk Version: v0.1.1+git"
time="2018-11-26T20:39:58Z" level=info msg="Starting to serve on 127.0.0.1:8888\n"
time="2018-11-26T20:39:58Z" level=info msg="Watching automationbroker.io/v1alpha1, AutomationBroker"
{"level":"info","ts":1543264798.8120434,"logger":"kubebuilder.controller","caller":"controller/controller.go:120","msg":"Starting EventSource","Controller":"automationbroker-controller","Source":"kind source: automationbroker.io/v1alpha1, Kind=AutomationBroker"}
{"level":"info","ts":1543264798.9125986,"logger":"kubebuilder.controller","caller":"controller/controller.go:134","msg":"Starting Controller","Controller":"automationbroker-controller"}
{"level":"info","ts":1543264799.012966,"logger":"kubebuilder.controller","caller":"controller/controller.go:153","msg":"Starting workers","Controller":"automationbroker-controller","WorkerCount":1}
time="2018-11-26T20:40:19Z" level=info msg="[playbook task]: automation-broker : Set broker namespace state=present" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" job=2155918680542341066 name=automation-broker namespace=automation-broker
time="2018-11-26T20:40:19Z" level=info msg="[playbook task]: automation-broker : Verify service catalog is installed" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" job=2155918680542341066 name=automation-broker namespace=automation-broker
time="2018-11-26T20:40:20Z" level=info msg="[playbook task]: automation-broker : Verify broker namespace exists" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" job=2155918680542341066 name=automation-broker namespace=automation-broker
time="2018-11-26T20:40:20Z" level=info msg="[playbook task]: automation-broker : Verify the broker has no outstanding service instances if deprovisioning" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" job=2155918680542341066 name=automation-broker namespace=automation-broker
time="2018-11-26T20:40:20Z" level=info msg="[playbook task]: automation-broker : Create dashboard redirector route if OpenShift" component=logging_event_handler event_type=playbook_on_task_start gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" job=2155918680542341066 name=automation-broker namespace=automation-broker
time="2018-11-26T20:40:21Z" level=info msg="injecting owner reference"

@djzager updated my original issue comment. As indicated by the issue you linked that Erik originally wrote, adding brokerLocalOpenshiftEnabled: true to the CR spec fixes the issue:

apiVersion: "automationbroker.io/v1alpha1"                                        
kind: "AutomationBroker"                                                          
[...]                                        
spec:                                                                                                                   
[...]                                   
  brokerLocalOpenshiftEnabled: true

Had a discussion/found a potential solution for this issue with @djzager and @jmontleon .

After rebuilding ASB operator on the latest ansible-operator image from water-hole based on operator-sdk v0.1.1, using the default CR spec values did work, but lead to very slow completion of each run of the AO reconciliation loop.

Noticed in the ASB operator logs that "Broker ready status" was the final action that ran before seeing "exiting successfully":

time="2018-11-27T16:33:19Z" level=info msg="[playbook debug]: msg=Broker ready status: [u'True']" component=logging_event_handler event_type=runner_on_ok gvk="automationbroker.io/v1alpha1, Kind=AutomationBroker" job=295688355826356509 name=automation-broker namespace=automation-broker

Thought this delayed "Broker ready" message was strange because we specified to not wait for broker readiness so that the reconciliation loop would run quicker:

spec:                                                                                                                   
  [...]
  waitForBroker: "false"                                                          
  [...]

Seeing that the operator was still waiting for broker deploy to complete, we hypothesized that "false" was not parsed as a boolean with value false by the playbook, and that instead the string "false" was being interpreted as a truthy value. When using the OLM web UI, pre-filled spec values always gets stringified as such.

Turns out removing waitForBroker from the spec vars and letting the default value of false get used sped up the reconciliation loop, and now it doesn't feel like the reconciliation loop is freezing at all.

I've verified that removing the quotes around waitForBroker also solves the issue, but since OLM adds the quotes...

Proposed solution

  • Remove waitForBroker as a default var in sample CR
  • Add casts from string to boolean in the ASB deploy playbook
  • Update official ASB operator image to use latest ansible-operator image from water-hole

Whoops finger slipped, please ignore above activity. Proposal makes sense to me, good find on the source of the issue. I wonder if there are other values failing more silently in similar ways.

Good point, other quoted boolean values provided in the spec are going to be misinterpreted as well (unless we have boolean casts in place for some vars but not others).

I went ahead and updated docker.io/automationbroker/automation-broker-operator:latest with an operator-sdk v0.1.1 based image, so if you're using catasb to play with this stuff you should be able to successfully deploy ASB with the operator consistently without changing the CR spec.

If you do remove the waitForBroker value specification the reconciliation loop should speed up significantly.

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

/close

@jmrodri: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.