openshift/ansible-service-broker

APBs that worked w/ Openshift 3.9 (and ASB-1.1) not working with 3.10 (and ASB-1.2) ?

matzew opened this issue · 17 comments

Bug:

using the 3.10 CLI, and running oc cluster up --enable=service-catalog,web-console I get Openshift.

Than I install the Automation broker, doing:

kubectl apply -f https://raw.githubusercontent.com/project-streamzi/ocp-broker/ASB_12_oc310/install.yaml

Which basically contains commit 313572af9d865f4ca5167c5342cffb37ec798179 from @djzager AND I also provide the broker_dockerhub_org argument.

This brings up the catalog w/ my APBs -> 🎉
(Therefore I am closing #1041)

However, now: running an APB does not work.

What happened:

Here is an example of the failure that occured:

TASK [provision-strimzi-apb : Login As Super User] *****************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "developer", "-p", "developer"], "delta": "0:00:00.406516", "end": "2018-08-17 08:40:01.361376", "msg": "non-zero return code", "rc": 1, "start": "2018-08-17 08:40:00.954860", "stderr": "error: dial tcp 127.0.0.1:8443: getsockopt: connection refused", "stderr_lines": ["error: dial tcp 127.0.0.1:8443: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}
PLAY RECAP *********************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1   

Also, I noticed (due to the failure) a ton of "dh-strimzi-apb-prov-XXXXX" projects are created, all the same error. I've never seen that, when a provison fails, that a ton of "retry"? projects have been created.

Also, on the UI, I noticed something like:

Failed to list clusterserviceplans/servicecatalog.k8s.io/v1beta1 (status -1)
Failed to list projects/project.openshift.io/v1 (status -1)

What you expected to happen:

APB runs smoothless with the 1.2 release

How to reproduce it:

  • Run install Openshift:
oc cluster up --enable=service-catalog,web-console
  • Install the ASB:
kubectl apply -f https://raw.githubusercontent.com/project-streamzi/ocp-broker/ASB_12_oc310/install.yaml

Also, I noticed (due to the failure) a ton of "dh-strimzi-apb-prov-XXXXX" projects are created, all the same error. I've never seen that, when a provison fails, that a ton of "retry"? projects have been created.

See #1010, this is related to the orphan mitigation in service-catalog.

Looking at your APBs logs, connection refused, makes me think that something is wrong|missing with the inter-pod networking in your cluster.

$ oc cluster up --enable=service-catalog,template-service-broker,router,registry,web-console,persistent-volumes,sample-templates,rhel-imagestreams
$ kubectl apply -f https://raw.githubusercontent.com/project-streamzi/ocp-broker/ASB_12_oc310/install.yaml

Then ran your strimzi-apb without issue:


PLAY [strimzi-apb playbook to provision the application] ***********************
--
  | TASK [ansible.kubernetes-modules : Install latest openshift client] ************
  | skipping: [localhost]
  | TASK [ansibleplaybookbundle.asb-modules : debug] *******************************
  | skipping: [localhost]
  | TASK [provision-strimzi-apb : Login As Super User] *****************************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create Cluster Operator Service Account yaml] ****
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create Cluster operator Service Account] *********
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Delete Cluster Operator Template File] ***********
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create Role] *************************************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create Role Based Access Control] ****************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create k8s deployment] ***************************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create Persistant Storage template] **************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Deploy a ZK and Kafka cluster] *******************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Wait for Strimzi topic Operator to become ready] ***
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (40 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (39 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (38 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (37 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (36 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (35 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (34 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (33 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (32 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (31 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (30 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (29 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (28 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (27 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (26 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (25 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (24 retries left).
  | changed: [localhost]
  | PLAY RECAP *********************************************************************
  | localhost                  : ok=10   changed=10   unreachable=0    failed=0

hrm...

getting


  | TASK [provision-strimzi-apb : Login As Super User] *****************************
-- | --
  | fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "developer", "-p", "developer"], "delta": "0:00:00.592034", "end": "2018-08-20 08:16:33.348767", "msg": "non-zero return code", "rc": 1, "start": "2018-08-20 08:16:32.756733", "stderr": "error: dial tcp 127.0.0.1:8443: getsockopt: connection refused", "stderr_lines": ["error: dial tcp 127.0.0.1:8443: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}
  | PLAY RECAP *********************************************************************
  | localhost


still an issue for me

I re-ran all the things, and I also got it working now

@djzager thanks for your help, dude!

Hi, I'm also seeing a similar error deploying Strimzi:

TASK [provision-strimzi-apb : Login As Super User] *****************************
  | fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "developer", "-p", "de"], "delta": "0:00:00.349035", "end": "2018-08-20 13:34:11.415052", "msg": "non-zero return code", "rc": 1, "start": "2018-08-20 13:34:11.066017", "stderr": "error: dial tcp 127.0.0.1:8443: getsockopt: connection refused", "stderr_lines": ["error: dial tcp 127.0.0.1:8443: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}

I'm on Mac using the following Docker version:

Client:
 Version:      17.09.1-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   19e2cf6
 Built:        Thu Dec  7 22:22:25 2017
 OS/Arch:      darwin/amd64

Server:
 Version:      17.09.1-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   19e2cf6
 Built:        Thu Dec  7 22:28:28 2017
 OS/Arch:      linux/amd64
 Experimental: true

and oc

oc v3.10.0+dd10d17
kubernetes v1.10.0+b81c8f8
features: Basic-Auth

Server https://127.0.0.1:8443
openshift v3.10.0+20c7bd1-8
kubernetes v1.10.0+b81c8f8

Is it possible @sjwoodman that your cluster was previously started without enabling the router? I noticed that your project's install script does have an oc cluster up but it would simply skip this step if a cluster had already been started. The router not being enabled is the only thing that makes sense to be giving the connection refused based on what I see in the issue.

You should also consider doing a docker system prune -a -f to make sure you don't have any stale origin images impacting you negatively.

Hi David, I've tried a docker system prune -a -f but see the same behaviour. In terms of the state of OpenShift it's a clean install as I removed the openshift.cluster.local directory between each attempt. Are there any logs that you would suggest looking at to diagnose further?

@djzager So, this all runs fine on my Fedora - but not on mac.

We start oc cluster up with --routing-suffix=${ROUTING_SUFFIX} --public-hostname=${PUBLIC_IP}. All fine on Linux, but on Mac we get:

`fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "developer", "-p", "d"], "delta": "0:00:00.296468", "end": "2018-09-17 09:57:04.381026", "msg": "non-zero return code", "rc": 1, "start": "2018-09-17 09:57:04.084558", "stderr": "error: dial tcp 127.0.0.1:8443: getsockopt: connection refused", "stderr_lines": ["error: dial tcp 127.0.0.1:8443: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}` (edited)

I realized that instead of the broker-apb, the template seems to have a ROUTING_SUFFIX parameter: https://github.com/openshift/ansible-service-broker/blob/master/templates/deploy-ansible-service-broker.template.yaml#L390-L391

Is there an equivialent for that in the APB ?

here is our customized config, pointing to the 1.2 image:
https://github.com/project-streamzi/ocp-broker/blob/clean_up/install.yaml#L40

@djzager wondering is there anyone in your team that uses a mac for development? So he could try to execute our script ?

any comment @djzager ?

I'm going to add @jmontleon to this as I'm not really in a position to be helpful (paternity leave) at the moment.

@matzew we were actually doing some investigation on a separate issue openshift/origin#20991 for the same error connecting to the public hostname.

I think from what we could see when you set --public-hostname it doesn't work correctly on Mac. @jwmatthews was experimenting on his Mac and was able to reproduce the issue. I think he mentioned socat is used by oc to allow the connection to work when you don't use --public-hostname and that it may not be setting up the relay or not setting it up properly.

To work around it don't use the --public-hostname (and possibly also --routing-suffix) option on Mac.

Thanks @djzager - enjoy your time off!

@jmontleon thanks, and you are right with what you say and what is in #20991 but they are separate issues. On a Mac with OpenShift 3.10 if you set --public-hostname and --routing-suffix OpenShift will not startup - it fails with a timeout.

However, if you do not set those parameters OpenShift will boot but APBs will not work (replicated on two different Macs). The failure is as @matzew listed (from the APB not OpenShift itself):

fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "developer", "-p", "d"], "delta": "0:00:00.296468", "end": "2018-09-17 09:57:04.381026", "msg": "non-zero return code", "rc": 1, "start": "2018-09-17 09:57:04.084558", "stderr": "error: dial tcp 127.0.0.1:8443: getsockopt: connection refused", "stderr_lines": ["error: dial tcp 127.0.0.1:8443: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

/close

@jmrodri: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.