APBs that worked w/ Openshift 3.9 (and ASB-1.1) not working with 3.10 (and ASB-1.2) ?

Question

APBs that worked w/ Openshift 3.9 (and ASB-1.1) not working with 3.10 (and ASB-1.2) ?

matzew opened this issue 6 years ago · 17 comments

Bug:

using the 3.10 CLI, and running oc cluster up --enable=service-catalog,web-console I get Openshift.

Than I install the Automation broker, doing:

kubectl apply -f https://raw.githubusercontent.com/project-streamzi/ocp-broker/ASB_12_oc310/install.yaml

Which basically contains commit 313572af9d865f4ca5167c5342cffb37ec798179 from @djzager AND I also provide the broker_dockerhub_org argument.

This brings up the catalog w/ my APBs -> 🎉
(Therefore I am closing #1041)

However, now: running an APB does not work.

What happened:

Here is an example of the failure that occured:

TASK [provision-strimzi-apb : Login As Super User] *****************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "developer", "-p", "developer"], "delta": "0:00:00.406516", "end": "2018-08-17 08:40:01.361376", "msg": "non-zero return code", "rc": 1, "start": "2018-08-17 08:40:00.954860", "stderr": "error: dial tcp 127.0.0.1:8443: getsockopt: connection refused", "stderr_lines": ["error: dial tcp 127.0.0.1:8443: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}
PLAY RECAP *********************************************************************
localhost                  : ok=0    changed=0    unreachable=0    failed=1

Also, I noticed (due to the failure) a ton of "dh-strimzi-apb-prov-XXXXX" projects are created, all the same error. I've never seen that, when a provison fails, that a ton of "retry"? projects have been created.

Also, on the UI, I noticed something like:

Failed to list clusterserviceplans/servicecatalog.k8s.io/v1beta1 (status -1)
Failed to list projects/project.openshift.io/v1 (status -1)

What you expected to happen:

APB runs smoothless with the 1.2 release

How to reproduce it:

Run install Openshift:

oc cluster up --enable=service-catalog,web-console

Install the ASB:

kubectl apply -f https://raw.githubusercontent.com/project-streamzi/ocp-broker/ASB_12_oc310/install.yaml

jmrodri commented 4 years ago

/close

Answer 1 · 2018-08-17T14:07:07.000Z

Also, I noticed (due to the failure) a ton of "dh-strimzi-apb-prov-XXXXX" projects are created, all the same error. I've never seen that, when a provison fails, that a ton of "retry"? projects have been created.

See #1010, this is related to the orphan mitigation in service-catalog.

Looking at your APBs logs, connection refused, makes me think that something is wrong|missing with the inter-pod networking in your cluster.

$ oc cluster up --enable=service-catalog,template-service-broker,router,registry,web-console,persistent-volumes,sample-templates,rhel-imagestreams
$ kubectl apply -f https://raw.githubusercontent.com/project-streamzi/ocp-broker/ASB_12_oc310/install.yaml

Then ran your strimzi-apb without issue:


PLAY [strimzi-apb playbook to provision the application] ***********************
--
  | TASK [ansible.kubernetes-modules : Install latest openshift client] ************
  | skipping: [localhost]
  | TASK [ansibleplaybookbundle.asb-modules : debug] *******************************
  | skipping: [localhost]
  | TASK [provision-strimzi-apb : Login As Super User] *****************************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create Cluster Operator Service Account yaml] ****
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create Cluster operator Service Account] *********
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Delete Cluster Operator Template File] ***********
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create Role] *************************************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create Role Based Access Control] ****************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create k8s deployment] ***************************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Create Persistant Storage template] **************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Deploy a ZK and Kafka cluster] *******************
  | changed: [localhost]
  | TASK [provision-strimzi-apb : Wait for Strimzi topic Operator to become ready] ***
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (40 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (39 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (38 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (37 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (36 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (35 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (34 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (33 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (32 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (31 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (30 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (29 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (28 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (27 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (26 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (25 retries left).
  | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (24 retries left).
  | changed: [localhost]
  | PLAY RECAP *********************************************************************
  | localhost                  : ok=10   changed=10   unreachable=0    failed=0

Answer 2 · 2018-08-17T16:24:55.000Z

ok, cool! will check tomorrow

On Fri 17. Aug 2018 at 16:07, David Zager ***@***.***> wrote: Also, I noticed (due to the failure) a ton of "dh-strimzi-apb-prov-XXXXX" projects are created, all the same error. I've never seen that, when a provison fails, that a ton of "retry"? projects have been created. See #1010 <#1010>, this is related to the orphan mitigation in service-catalog. Looking at your APBs logs, connection refused, makes me think that something is wrong|missing with the inter-pod networking in your cluster. $ oc cluster up --enable=service-catalog,template-service-broker,router,registry,web-console,persistent-volumes,sample-templates,rhel-imagestreams $ kubectl apply -f https://raw.githubusercontent.com/project-streamzi/ocp-broker/ASB_12_oc310/install.yaml Then ran your strimzi-apb without issue: PLAY [strimzi-apb playbook to provision the application] *********************** -- | TASK [ansible.kubernetes-modules : Install latest openshift client] ************ | skipping: [localhost] | TASK [ansibleplaybookbundle.asb-modules : debug] ******************************* | skipping: [localhost] | TASK [provision-strimzi-apb : Login As Super User] ***************************** | changed: [localhost] | TASK [provision-strimzi-apb : Create Cluster Operator Service Account yaml] **** | changed: [localhost] | TASK [provision-strimzi-apb : Create Cluster operator Service Account] ********* | changed: [localhost] | TASK [provision-strimzi-apb : Delete Cluster Operator Template File] *********** | changed: [localhost] | TASK [provision-strimzi-apb : Create Role] ************************************* | changed: [localhost] | TASK [provision-strimzi-apb : Create Role Based Access Control] **************** | changed: [localhost] | TASK [provision-strimzi-apb : Create k8s deployment] *************************** | changed: [localhost] | TASK [provision-strimzi-apb : Create Persistant Storage template] ************** | changed: [localhost] | TASK [provision-strimzi-apb : Deploy a ZK and Kafka cluster] ******************* | changed: [localhost] | TASK [provision-strimzi-apb : Wait for Strimzi topic Operator to become ready] *** | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (40 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (39 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (38 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (37 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (36 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (35 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (34 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (33 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (32 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (31 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (30 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (29 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (28 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (27 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (26 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (25 retries left). | FAILED - RETRYING: Wait for Strimzi topic Operator to become ready (24 retries left). | changed: [localhost] | PLAY RECAP ********************************************************************* | localhost : ok=10 changed=10 unreachable=0 failed=0 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1049 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJnztYB5hb75YI5D-wOA_QlmdrBjmhCks5uRs4dgaJpZM4WBNDc> .

-- Sent from Gmail Mobile

Answer 3 · 2018-08-20T08:22:45.000Z

hrm...

getting


  | TASK [provision-strimzi-apb : Login As Super User] *****************************
-- | --
  | fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "developer", "-p", "developer"], "delta": "0:00:00.592034", "end": "2018-08-20 08:16:33.348767", "msg": "non-zero return code", "rc": 1, "start": "2018-08-20 08:16:32.756733", "stderr": "error: dial tcp 127.0.0.1:8443: getsockopt: connection refused", "stderr_lines": ["error: dial tcp 127.0.0.1:8443: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}
  | PLAY RECAP *********************************************************************
  | localhost

still an issue for me

Answer 4 · 2018-08-20T12:25:16.000Z

I re-ran all the things, and I also got it working now

Answer 5 · 2018-08-20T12:25:29.000Z

@djzager thanks for your help, dude!

Answer 6 · 2018-08-20T13:38:35.000Z

Hi, I'm also seeing a similar error deploying Strimzi:

TASK [provision-strimzi-apb : Login As Super User] *****************************
  | fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "developer", "-p", "de"], "delta": "0:00:00.349035", "end": "2018-08-20 13:34:11.415052", "msg": "non-zero return code", "rc": 1, "start": "2018-08-20 13:34:11.066017", "stderr": "error: dial tcp 127.0.0.1:8443: getsockopt: connection refused", "stderr_lines": ["error: dial tcp 127.0.0.1:8443: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}

I'm on Mac using the following Docker version:

Client:
 Version:      17.09.1-ce
 API version:  1.32
 Go version:   go1.8.3
 Git commit:   19e2cf6
 Built:        Thu Dec  7 22:22:25 2017
 OS/Arch:      darwin/amd64

Server:
 Version:      17.09.1-ce
 API version:  1.32 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   19e2cf6
 Built:        Thu Dec  7 22:28:28 2017
 OS/Arch:      linux/amd64
 Experimental: true

and oc

oc v3.10.0+dd10d17
kubernetes v1.10.0+b81c8f8
features: Basic-Auth

Server https://127.0.0.1:8443
openshift v3.10.0+20c7bd1-8
kubernetes v1.10.0+b81c8f8

Answer 7 · 2018-08-20T20:42:25.000Z

Is it possible @sjwoodman that your cluster was previously started without enabling the router? I noticed that your project's install script does have an oc cluster up but it would simply skip this step if a cluster had already been started. The router not being enabled is the only thing that makes sense to be giving the connection refused based on what I see in the issue.

You should also consider doing a docker system prune -a -f to make sure you don't have any stale origin images impacting you negatively.

Answer 8 · 2018-08-23T07:46:17.000Z

Hi David, I've tried a docker system prune -a -f but see the same behaviour. In terms of the state of OpenShift it's a clean install as I removed the openshift.cluster.local directory between each attempt. Are there any logs that you would suggest looking at to diagnose further?

Answer 9 · 2018-09-17T13:26:21.000Z

@djzager So, this all runs fine on my Fedora - but not on mac.

We start oc cluster up with --routing-suffix=${ROUTING_SUFFIX} --public-hostname=${PUBLIC_IP}. All fine on Linux, but on Mac we get:

`fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "developer", "-p", "d"], "delta": "0:00:00.296468", "end": "2018-09-17 09:57:04.381026", "msg": "non-zero return code", "rc": 1, "start": "2018-09-17 09:57:04.084558", "stderr": "error: dial tcp 127.0.0.1:8443: getsockopt: connection refused", "stderr_lines": ["error: dial tcp 127.0.0.1:8443: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}` (edited)

I realized that instead of the broker-apb, the template seems to have a ROUTING_SUFFIX parameter: https://github.com/openshift/ansible-service-broker/blob/master/templates/deploy-ansible-service-broker.template.yaml#L390-L391

Is there an equivialent for that in the APB ?

here is our customized config, pointing to the 1.2 image:
https://github.com/project-streamzi/ocp-broker/blob/clean_up/install.yaml#L40

Answer 10 · 2018-09-17T13:34:31.000Z

@djzager wondering is there anyone in your team that uses a mac for development? So he could try to execute our script ?

Answer 11 · 2018-09-19T15:14:38.000Z

any comment @djzager ?

Answer 12 · 2018-09-19T15:21:59.000Z

I'm going to add @jmontleon to this as I'm not really in a position to be helpful (paternity leave) at the moment.

Answer 13 · 2018-09-19T15:28:08.000Z

@matzew we were actually doing some investigation on a separate issue openshift/origin#20991 for the same error connecting to the public hostname.

I think from what we could see when you set --public-hostname it doesn't work correctly on Mac. @jwmatthews was experimenting on his Mac and was able to reproduce the issue. I think he mentioned socat is used by oc to allow the connection to work when you don't use --public-hostname and that it may not be setting up the relay or not setting it up properly.

To work around it don't use the --public-hostname (and possibly also --routing-suffix) option on Mac.

Answer 14 · 2018-09-20T08:07:34.000Z

Thanks @djzager - enjoy your time off!

@jmontleon thanks, and you are right with what you say and what is in #20991 but they are separate issues. On a Mac with OpenShift 3.10 if you set --public-hostname and --routing-suffix OpenShift will not startup - it fails with a timeout.

However, if you do not set those parameters OpenShift will boot but APBs will not work (replicated on two different Macs). The failure is as @matzew listed (from the APB not OpenShift itself):

fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["oc", "login", "-u", "developer", "-p", "d"], "delta": "0:00:00.296468", "end": "2018-09-17 09:57:04.381026", "msg": "non-zero return code", "rc": 1, "start": "2018-09-17 09:57:04.084558", "stderr": "error: dial tcp 127.0.0.1:8443: getsockopt: connection refused", "stderr_lines": ["error: dial tcp 127.0.0.1:8443: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}

Answer 15 · 2020-08-26T15:36:42.000Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Answer 16 · 2020-09-20T01:10:57.000Z

@jmrodri: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.