libvirt: Unable to access web console

Question

libvirt: Unable to access web console

rhopp opened this issue 6 years ago · 47 comments

Version

$ openshift-install version
v0.9.0-master

(compiled from master)

Platform (aws|libvirt|openstack):

libvirt

What happened?

I'm trying to install openshift 4 using this installer. It seems, that everything was OK. I've done all the steps described in here. Installation was ok, I was able to login using oc with credentials from the installation output, but I'm not able to access web console.

Looking at openshift-console project, everything seems ok:

OUTPUT

╭─rhopp@dhcp-10-40-4-106 ~/go/src/github.com/openshift/installer  ‹master*› 
╰─$ oc project openshift-console
Already on project "openshift-console" on server "https://test1-api.tt.testing:6443".
╭─rhopp@dhcp-10-40-4-106 ~/go/src/github.com/openshift/installer  ‹master*› 
╰─$ oc get all
NAME                                     READY     STATUS    RESTARTS   AGE
pod/console-operator-79b8b8cb8d-cgpfn    1/1       Running   1          1h
pod/openshift-console-6ddfcc76b5-2kmpx   1/1       Running   0          1h
pod/openshift-console-6ddfcc76b5-sp5zm   1/1       Running   0          1h
pod/openshift-console-6ddfcc76b5-z52hq   1/1       Running   0          1h

NAME              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/console   ClusterIP   172.30.198.57   <none>        443/TCP   1h

NAME                                DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/console-operator    1         1         1            1           1h
deployment.apps/openshift-console   3         3         3            3           1h

NAME                                           DESIRED   CURRENT   READY     AGE
replicaset.apps/console-operator-79b8b8cb8d    1         1         1         1h
replicaset.apps/openshift-console-6ddfcc76b5   3         3         3         1h

NAME                               HOST/PORT                                         PATH      SERVICES   PORT      TERMINATION          WILDCARD
route.route.openshift.io/console   console-openshift-console.apps.test1.tt.testing             console    https     reencrypt/Redirect   None

The pods are running, service and route are up, but accessing https://console-openshift-console.apps.test1.tt.testing in browser says it couldn't resolve IP address.

As part of the setup I've configured dnsmasq as it was described in the libvirt guide.
For example, ping test1-api.tt.testing works as expected, but ping console-openshift-console.apps.test1.tt.testing throws:

ping: console-openshift-console.apps.test1.tt.testing: Name or service not known

What you expected to happen?

Web console to be accessible.

How to reproduce it (as minimally and precisely as possible)?

Follow https://github.com/openshift/installer/blob/master/docs/dev/libvirt-howto.md (my host machine is Fedora 29)

INSTALLATION OUTPUT

╭─rhopp@localhost ~/go/src/github.com/openshift/installer/bin  ‹master*› 
╰─$ ./openshift-install create cluster
? SSH Public Key  [Use arrows to move, type to filter, ? for more help]
  /home/rhopp/.ssh/gitlab.cee.key.pub
> <none>
? SSH Public Key  [Use arrows to move, type to filter, ? for more help]
> /home/rhopp/.ssh/gitlab.cee.key.pub
  <none>
? SSH Public Key /home/rhopp/.ssh/gitlab.cee.key.pub
? Platform  [Use arrows to move, type to filter]
> aws
  libvirt
  openstack
? Platform  [Use arrows to move, type to filter]
  aws
> libvirt
  openstack
? Platform libvirt
? Libvirt Connection URI [? for help] (qemu+tcp://192.168.122.1/system) 
? Libvirt Connection URI qemu+tcp://192.168.122.1/system
? Base Domain [? for help] tt.testing
? Base Domain tt.testing
? Cluster Name [? for help] test1
? Cluster Name test1
? Pull Secret [? for help] *************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************                           INFO Fetching OS image: redhat-coreos-maipo-47.247-qemu.qcow2.gz 
INFO Creating cluster...                          
INFO Waiting up to 30m0s for the Kubernetes API... 
INFO API v1.11.0+e3fa228 up                       
INFO Waiting up to 30m0s for the bootstrap-complete event... 
INFO Destroying the bootstrap resources...        
INFO Waiting up to 10m0s for the openshift-console route to be created... 
INFO Install complete!                            
INFO Run 'export KUBECONFIG=/home/rhopp/go/src/github.com/openshift/installer/bin/auth/kubeconfig' to manage the cluster with 'oc', the OpenShift CLI. 
INFO The cluster is ready when 'oc login -u kubeadmin -p 5tQwM-fXfkC-MIeAH-BmLeN' succeeds (wait a few minutes). 
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.test1.tt.testing 
INFO Login to the console with user: kubeadmin, password: 5tQwM-fXfkC-MIeAH-BmLeN

Answer 1 · 2019-01-07T18:50:24.000Z

Duplicate of #411.

Answer 2 · 2019-01-07T18:50:44.000Z

@crawford: Closing this issue.

In response to this:

Duplicate of #411.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Answer 3 · 2019-01-08T19:59:02.000Z

#411 was closed, since AWS works. Reopening for libvirt.

Answer 4 · 2019-03-07T04:20:36.000Z

Docs in flight with #1371

Answer 5 · 2019-03-10T14:33:11.000Z

Hi,

Does this working? #1371
he responds by all wildcard?

Best Regards,
Fábio Sbano

Answer 6 · 2019-05-22T16:07:12.000Z

90b0d45 only documents a workaround, unfortunately.

/reopen

Answer 7 · 2019-05-22T16:07:15.000Z

@zeenix: Reopened this issue.

In response to this:

90b0d45 only documents a workaround, unfortunately.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Answer 8 · 2019-05-30T20:33:21.000Z

Has anyone had luck with the work around posted in 90b0d45 recently? My libvirt cluster does not bring up the console operator with or without the documented workaround.

Answer 9 · 2019-05-30T22:55:30.000Z

I tried setting the oauth hostname statically without wildcards in my dnsmasq config and im still getting oauth console errors.
See below.

dnsmasq config

~$ cat /etc/NetworkManager/dnsmasq.d/openshift.conf 
server=/tt.testing/192.168.126.1
address=/.apps.tt.testing/192.168.126.51
address=/oauth-openshift.apps.test1.tt.testing/192.168.126.51

Sanity check that hostname is resolving to proper node IP

~$ ping oauth-openshift.apps.test1.tt.testing
PING oauth-openshift.apps.test1.tt.testing (192.168.126.51) 56(84) bytes of data.
64 bytes from 192.168.126.51 (192.168.126.51): icmp_seq=1 ttl=64 time=0.114 ms
64 bytes from 192.168.126.51 (192.168.126.51): icmp_seq=2 ttl=64 time=0.136 ms

Output of openshift-console crashed pod logs

~$ oc logs -f console-67dbf7f789-k4gqg  
2019/05/30 22:51:45 cmd/main: cookies are secure!
2019/05/30 22:51:45 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.test1.tt.testing/oauth/token failed: Head https://oauth-openshift.apps.test1.tt.testing: dial tcp: lookup oauth-openshift.apps.test1.tt.testing on 172.30.0.10:53: no such host

Am I missing something?

Answer 10 · 2019-05-31T11:13:35.000Z

Has anyone had luck with the work around posted in 90b0d45 recently?

I just did and except for the usual timeout issue, the cluster came up all good afaict.

Answer 11 · 2019-06-28T15:19:54.000Z

/priority important-longterm

Answer 12 · 2019-06-28T15:22:27.000Z

@cfergeau You said you had a WIP patch to fix this on libvirt level. Do you think you'd be able to get that in, in the near future?

/assign @cfergeau

Answer 13 · 2019-06-28T15:22:29.000Z

@zeenix: GitHub didn't allow me to assign the following users: cfergeau.

Note that only openshift members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

@cfergeau You said you had a WIP patch to fix this on libvirt level. Do you think you'd be able to get that in, in the near future?

/assign @cfergeau

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Answer 14 · 2019-08-09T21:05:05.000Z

Hi. I did the same but still error persist.
Do I need to debug installer? or would there be any other pointer?

tail -f setup/.openshift_install.log
time="2019-08-10T04:47:10+08:00" level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n* Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (417 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-authentication-operator/authentication-operator" (382 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-controller-manager-operator/openshift-controller-manager-operator" (421 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-image-registry/image-registry" (388 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (398 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (402 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (406 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (144 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-machine-api/machine-api-operator" (408 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-operator-lifecycle-manager/olm-operator" (411 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (391 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (394 of 422): the server does not recognize this resource, check extension API servers"
time="2019-08-10T04:54:14+08:00" level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n* Could not update servicemonitor "openshift-apiserver-operator/openshift-apiserver-operator" (417 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-authentication-operator/authentication-operator" (382 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-controller-manager-operator/openshift-controller-manager-operator" (421 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-image-registry/image-registry" (388 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-kube-apiserver-operator/kube-apiserver-operator" (398 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-kube-controller-manager-operator/kube-controller-manager-operator" (402 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-kube-scheduler-operator/kube-scheduler-operator" (406 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-machine-api/cluster-autoscaler-operator" (144 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-machine-api/machine-api-operator" (408 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-operator-lifecycle-manager/olm-operator" (411 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator" (391 of 422): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor "openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator" (394 of 422): the server does not recognize this resource, check extension API servers"
time="2019-08-10T04:56:51+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209"
time="2019-08-10T04:56:51+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209: downloading update"
time="2019-08-10T04:56:56+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209"
time="2019-08-10T04:57:11+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209: 19% complete"
time="2019-08-10T04:57:22+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209: 82% complete"
time="2019-08-10T04:57:38+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209: 95% complete"
time="2019-08-10T05:00:27+08:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.okd-2019-08-09-191209: 95% complete"
time="2019-08-10T05:01:40+08:00" level=fatal msg="failed to initialize the cluster: Working towards 4.2.0-0.okd-2019-08-09-191209: 95% complete"

Answer 15 · 2019-08-12T09:08:09.000Z

@donghwicha Your issue is unrelated to this one.

Answer 16 · 2019-08-17T19:32:07.000Z

thanks. I fixed already.

Answer 17 · 2019-09-02T19:55:49.000Z

Has anyone had luck with the work around posted in 90b0d45 recently?

I just did and except for the usual timeout issue, the cluster came up all good afaict.

I increased my timeouts to 90 minutes but still no luck even after applying this "workaround".

Answer 18 · 2019-09-04T20:19:28.000Z

I was finally successful. I made a video to help anyone else having a tough time getting through the install process: https://youtu.be/4mFMqNExRWk

Answer 19 · 2019-09-10T13:27:58.000Z

To fix this, we probably want/need to make use of the new libvirt mechanism to pass verbatim options to dnsmasq but to be able to do that, we need terraform support.

Answer 20 · 2019-09-18T10:00:30.000Z

Update: Turns out we can make use of the existing XSLT feature of terraform libvirt provider for this.

Answer 21 · 2019-09-23T07:56:02.000Z

@zeenix I saw the issue closed in terraform side, so should we add some template in installer here or some other settings here?

Answer 22 · 2019-09-23T09:21:37.000Z

@jichenjc I was looking into this last week but w/o success yet. I've also heard that someone is working on this on the ingress operator level so I'll hold off my efforts for now.

Answer 23 · 2019-09-24T15:56:07.000Z

Hi,

All my services are running ....

https://twitter.com/fabiosbano/status/1175842429641080832?s=09

Best Regards,
Fabio Sbano

Answer 24 · 2019-09-25T01:30:50.000Z

Thanks, @ssbano , I saw a picture and what kind of changes makes that happen? Thanks a lot

Answer 25 · 2019-09-25T11:25:31.000Z

@jichenjc

I will write the steps performed

Best Regards,
Fabio Sbano

Answer 26 · 2019-09-26T14:08:34.000Z

@jichenjc

You can set dns (bind - bare metal) to resolve the * .apps.${domain} and i made some changes below

[root@argon ~]# cat /etc/NetworkManager/dnsmasq.d/openshift.conf 
server=/jaguar.fsbano.com/192.168.126.1
server=/apps.jaguar.fsbano.com/172.27.15.30
[root@argon ~]#

git diff

[root@argon installer]# git diff
diff --git a/cmd/openshift-install/create.go b/cmd/openshift-install/create.go
index 9021025b6..679649d1d 100644
--- a/cmd/openshift-install/create.go
+++ b/cmd/openshift-install/create.go
@@ -238,7 +238,7 @@ func waitForBootstrapComplete(ctx context.Context, config *rest.Config, director
 
        discovery := client.Discovery()
 
-       apiTimeout := 30 * time.Minute
+       apiTimeout := 60 * time.Minute
        logrus.Infof("Waiting up to %v for the Kubernetes API at %s...", apiTimeout, config.Host)
        apiContext, cancel := context.WithTimeout(ctx, apiTimeout)
        defer cancel()
@@ -279,7 +279,7 @@ func waitForBootstrapComplete(ctx context.Context, config *rest.Config, director
 // and waits for the bootstrap configmap to report that bootstrapping has
 // completed.
 func waitForBootstrapConfigMap(ctx context.Context, client *kubernetes.Clientset) error {
-       timeout := 30 * time.Minute
+       timeout := 60 * time.Minute
        logrus.Infof("Waiting up to %v for bootstrapping to complete...", timeout)
 
        waitCtx, cancel := context.WithTimeout(ctx, timeout)
@@ -317,7 +317,7 @@ func waitForBootstrapConfigMap(ctx context.Context, client *kubernetes.Clientset
 // waitForInitializedCluster watches the ClusterVersion waiting for confirmation
 // that the cluster has been initialized.
 func waitForInitializedCluster(ctx context.Context, config *rest.Config) error {
-       timeout := 30 * time.Minute
+       timeout := 60 * time.Minute
        logrus.Infof("Waiting up to %v for the cluster at %s to initialize...", timeout, config.Host)
        cc, err := configclient.NewForConfig(config)
        if err != nil {
diff --git a/data/data/libvirt/main.tf b/data/data/libvirt/main.tf
index 9ba88c9cf..152c78dd5 100644
--- a/data/data/libvirt/main.tf
+++ b/data/data/libvirt/main.tf
@@ -54,6 +54,11 @@ resource "libvirt_network" "net" {
   dns {
     local_only = true
 
+    forwarders { 
+        address = "172.27.15.30"
+        domain = "apps.${var.cluster_domain}"
+    }
+
     dynamic "srvs" {
       for_each = data.libvirt_network_dns_srv_template.etcd_cluster.*.rendered
       content {
diff --git a/data/data/libvirt/variables-libvirt.tf b/data/data/libvirt/variables-libvirt.tf
index 53cf68bae..79d1018e2 100644
--- a/data/data/libvirt/variables-libvirt.tf
+++ b/data/data/libvirt/variables-libvirt.tf
@@ -32,7 +32,7 @@ variable "libvirt_master_ips" {
 variable "libvirt_master_memory" {
   type        = string
   description = "RAM in MiB allocated to masters"
-  default     = "6144"
+  default     = "16384"
 }
 
 # At some point this one is likely to default to the number
diff --git a/pkg/asset/machines/libvirt/machines.go b/pkg/asset/machines/libvirt/machines.go
index 2ab6d9aa2..08847ab95 100644
--- a/pkg/asset/machines/libvirt/machines.go
+++ b/pkg/asset/machines/libvirt/machines.go
@@ -63,7 +63,7 @@ func provider(clusterID string, networkInterfaceAddress string, platform *libvir
                        APIVersion: "libvirtproviderconfig.openshift.io/v1beta1",
                        Kind:       "LibvirtMachineProviderConfig",
                },
-               DomainMemory: 7168,
+               DomainMemory: 16384,
                DomainVcpu:   4,
                Ignition: &libvirtprovider.Ignition{
                        UserDataSecret: userDataSecret,
[root@argon installer]#

Answer 27 · 2019-10-06T12:52:46.000Z

@ssbano
thanks a lot !

I actually tried the /etc/NetworkManager/dnsmasq.d/openshift.conf change and seems that works for me ( at least console start up)..
can I know the purpose of following lines? Thanks

+    forwarders { 
+        address = "172.27.15.30"
+        domain = "apps.${var.cluster_domain}"
+    }
+

Answer 28 · 2019-10-06T15:01:02.000Z

@jichenjc

I am using named for wildcard name resolution instead of dnsmasq

The ip address '172.27.15.30' is from my bind service physical machine

Best regards,
Fábio Sbano

Answer 29 · 2019-10-08T08:36:12.000Z

ok, thanks for the info ~

Answer 30 · 2019-10-24T17:11:04.000Z

Similar issue signature there on 4.2. Interestingly same exact configs (i am using Ansible to set up it) was working only first time and now constantly fails at almost final stage. Authentication - degraded. Spending whole day to find out what could cause that.
In my Bind ocp.example.com.zone i have *.apps IN A 192.168.1.254 where .254 is HAProxy LB with server infnod-0 infnod-0.ocp.example.com:443 check. So basically *.apps.ocp.example.com points to source balanced infra nodes.

frontend ocp-kubernetes-api-server
    mode tcp
    option tcplog
    bind api.ocp.example.com:6443
    default_backend ocp-kubernetes-api-server

backend ocp-kubernetes-api-server
    balance source
    mode tcp
    server boostrap-0 bootstrap-0.ocp.example.com:6443 check
    server master-0 master-0.ocp.example.com:6443 check
    server master-1 master-1.ocp.example.com:6443 check
    server master-2 master-2.ocp.example.com:6443 check

frontend ocp-machine-config-server
    bind api.ocp.example.com:22623
    default_backend ocp-machine-config-server
    mode tcp
    option tcplog

backend ocp-machine-config-server
    balance source
    mode tcp
    server bootstrap-0 bootstrap-0.ocp.example.com:22623 check
    server master-0 master-0.ocp.example.com:22623 check
    server master-1 master-1.ocp.example.com:22623 check
    server master-2 master-2.ocp.example.com:22623 check

frontend ocp-router-http
    bind apps.ocp.example.com:80
    default_backend ocp-router-http
    mode tcp
    option tcplog

backend ocp-router-http
    balance source
    mode tcp
    server infnod-0 infnod-0.ocp.example.com:80 check
    server infnod-1 infnod-1.ocp.example.com:80 check

frontend ocp-router-https
    bind apps.ocp.example.com:443
    default_backend ocp-router-https
    mode tcp
    option tcplog

backend ocp-router-https
    balance source
    mode tcp
    server infnod-0 infnod-0.ocp.example.com:443 check
    server infnod-1 infnod-1.ocp.example.com:443 check

It doesn't matter if i disable boostrap rules after bootstraping is done.

E1027 16:04:32.356766       1 controller.go:129] {AuthenticationOperator2 AuthenticationOperator2} failed with: failed handling the route: route is not available at canonical host oauth-openshift.apps.ocp.example.com: []

If i ssh core@master-0.ocp.example.com and ping/dig oauth-openshift.apps.ocp.example.com i get an IP of LB node (.254).

I don't know should infras be in this state at this point.

Before all this, i had issue with SeLinux on my LB machine because i was missing:

semanage port  -a 22623 -t http_port_t -p tcp
semanage port  -a 6443 -t http_port_t -p tcp

Answer 31 · 2020-02-20T13:45:43.000Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Answer 32 · 2020-03-21T17:49:10.000Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Answer 33 · 2020-03-27T16:52:41.000Z

is there a way for @openshift-bot to give this immunity to becoming stale?

Answer 34 · 2020-03-31T00:07:31.000Z

/remove-lifecycle rotten

This is still something we want to fix. There are just a surprisingly large number of pieces that are falling in place in the background that we need before we can tackle this.

Answer 35 · 2020-04-21T14:07:18.000Z

I might ask a stupid question but I'll ask anyway.

What is the reason of the option local_only = true when local_ony = false would fix this issue ?

local_only - (Optional) true/false: true means 'do not forward unresolved requests for this domain to the part DNS server

I ran the follow test:

sed -i 's/local_only = true/local_only = false/' /root/go/src/github.com/openshift/installer/data/data/libvirt/main.tf

TAGS=libvirt hack/build.sh
mkdir /root/bin
cp -rf /root/go/src/github.com/openshift/installer/bin/openshift-install /root/bin/

yum install dnsmasq

echo -e "[main]\ndns=dnsmasq" | sudo tee /etc/NetworkManager/conf.d/openshift.conf

echo listen-address=127.0.0.1 > /etc/NetworkManager/dnsmasq.d/openshift.conf
echo bind-interfaces >> /etc/NetworkManager/dnsmasq.d/openshift.conf
echo server=8.8.8.8 >> /etc/NetworkManager/dnsmasq.d/openshift.conf
echo address=/apps.ocp.openshift.local/192.168.126.1 >> /etc/NetworkManager/dnsmasq.d/openshift.conf

systemctl reload NetworkManager

3x master
3x workers

and using a Container Loadbalancer

/usr/bin/podman run -d --name loadbalancer --net host
-e API="bootstrap=192.168.126.10:6443,master-0=192.168.126.11:6443,master-1=192.168.126.12:6443,master-2=192.168.126.13:6443"
-e API_LISTEN="0.0.0.0:6443"
-e INGRESS_HTTP="worker-0=192.168.126.51:80,worker-1=192.168.126.52:80,worker-2=192.168.126.53:80"
-e INGRESS_HTTP_LISTEN="0.0.0.0:80"
-e INGRESS_HTTPS="worker-0=192.168.126.51:443,worker-1=192.168.126.52:443,worker-2=192.168.126.53:443"
-e INGRESS_HTTPS_LISTEN="0.0.0.0:443"
-e MACHINE_CONFIG_SERVER="bootstrap=192.168.126.10:22623,master-0=192.168.126.10:22623,master-1=192.168.126.11:22623,master-2=192.168.126.12:22623"
-e MACHINE_CONFIG_SERVER_LISTEN="127.0.0.1:22623"
quay.io/redhat-emea-ssa-team/openshift-4-loadbalancer

And the installation went well.

Answer 36 · 2020-04-22T16:55:38.000Z

I use to solve this by changing the APPS URL to apps. instead of apps.. but, since I want to use the default APPs URL, I've also solved it by modifying data/data/libvirt/main.tf, but instead of changing the local_only, I added a forward entry just for apps.. domains to the libvirt network gateway, and in the KVM host where I have this configuration in dnsmasq managed by NetworkManager:

dns {
local_only = true
forwarders {
address = "192.168.122.1"
domain = "apps.$clustername .$basedomain"
}

This is the KVM dnsmasq config:
server=/$basedomain/192.168.126.1
address=/.apps.$clustername.$basedomain/192.168.126.1

Doing we maintain libvirt dnsmasq manage everything less the apps URL (that it wouldn't resolve because of this issue) that are forwarded to the KVM dnsmasq that actually works.

You can check my playbook that configure the kvm here:
https://github.com/luisarizmendi/ocp-libvirt-ipi-role/blob/master/tasks/kvm_deploy.yml

And the playbook that change the data/data/libvirt/main.tf file here:
https://github.com/luisarizmendi/ocp-libvirt-ipi-role/blob/master/tasks/ocp_deploy.yml

Answer 37 · 2020-04-23T07:38:18.000Z

https://gitlab.com/libvirt/libvirt/-/commit/fb9f6ce625322d10b2e2a7c3ce4faab780b97e8d might be a way to add the needed options to libvirt dnsmasq instance, which would allow to do all the cluster-related name resolution on 192.168.126.1 rather than having to go through a second dnsmasq instance managed by NetworkManager.

Answer 38 · 2020-04-23T10:03:20.000Z

@cfergeau totally agreed, I did some tests on RHEL/CENTOS8 with libvirt 5.6 where libvirt manage all the DNS entries including *.apps.

https://github.com/RedHat-EMEA-SSA-Team/labs/tree/master/disk-encryption#creating-libvirt-network

Best Regards

Answer 39 · 2020-06-01T16:37:10.000Z

To make the feature proposed by @ralvares work when using the Terraform provider for libvirt, the following XSLT transformation can be applied https://github.com/samuelvl/ocp4-disconnected-lab/blob/master/src/dns/libvirt-dns.xml.tpl

resource "libvirt_network" "openshift" {
  ...
  xml {
    xslt = data.template_file.openshift_libvirt_dns.rendered
  }
}

Answer 40 · 2020-07-03T21:33:37.000Z

Here's a workaround: #1648 (comment)

Answer 41 · 2020-08-06T21:10:38.000Z

Last week while trying to do some basic verification I ran into an issue where the workaround listed in the installer troubleshooting doc wasn't working. We figured out it was due to the fact that I had spun up a cluster with three workers, but the ingress controller has 2 set in its replicaset. So neither of those pods landed on the. 51 worker -- and we saw the same symptoms as if no workaround had been applied. It doesn't look like there's a way to do wildcards and have multiple IPs for a host entry. dnsmasq seems to take the last entry in a file as the IP instead of do any kind of round-robin. Any suggestions? Or do we just need to edit the manifest for the ingress operator to create 3 replicas?

Answer 42 · 2020-10-27T17:27:46.000Z

@clnperez I'm running into the same issue. Did you manage to find a solve?

Answer 43 · 2020-10-29T21:40:33.000Z

@marshallford no, nothing other than spinning up that 3rd replica for the ingress.

Answer 44 · 2021-01-28T01:09:00.000Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Answer 45 · 2021-02-27T01:25:34.000Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Answer 46 · 2021-03-29T05:44:55.000Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Answer 47 · 2021-03-29T05:45:08.000Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.