ocp-power-automation/ocp4-upi-powervm

Removal of bootstrap node with NFS for shared storage hangs or has other issues

robgjertsen1 opened this issue · 1 comments

I've tried this 3 times and it fails each time. Initially I installed a cluster and was running workload on it and it was OK. I only saw issues
once I tried to remove the bootstrap node and seeing issues accessing pvcs. The bootstrap remove got hung up (removed node from PowerVC and but stuck later on). An odd problem with NFS where IOs were hung but not an obvious issue with physical storage. Then I tried to remove bootstrap node immediately after recreating the cluster. This also resulted in issues where NFS filesystem wasn't mounted, and yet another issue where again the terraform execution was stuck after removing bootstrap node from PowerVC (yet NFS mount was OK here).

Here are some details below with last attempt. We are stuck in the gathering facts task for the ansible ocp4-helpernode playbook

Output from terraform:

======================================================================================================
$ terraform apply -var-file var.tfvars
module.workernodes.data.ignition_file.w_hostname[0]: Reading...
module.bootstrapnode.data.ignition_file.b_hostname: Reading...
module.masternodes.data.ignition_file.m_hostname[0]: Reading...
module.bootstrapnode.data.ignition_file.b_hostname: Read complete after 0s [id=1
ec8928da9e89f9b35deb26dd484665fda91d99d73e31330dce71edf3a4e19cc]
module.masternodes.data.ignition_file.m_hostname[0]: Read complete after 0s [id=
7551bfa9e87523c711bf18607b8af5ccfee1657ea6c4817bbc3dd2186602f590]
module.workernodes.data.ignition_file.w_hostname[0]: Read complete after 0s [id=
28b9dcc333049039879c9c1e94f95816f0341945047e8ae59674e1233f72be83]
module.workernodes.data.openstack_compute_flavor_v2.worker: Reading...
module.masternodes.data.openstack_compute_flavor_v2.master: Reading...
module.bootstrapnode.data.openstack_compute_flavor_v2.bootstrap: Reading...
module.bastion.openstack_compute_keypair_v2.key-pair[0]: Refreshing state... [id
=merlin2-keypair]
module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Refreshing st
ate... [id=317b2360-639d-4d8a-8b34-a58f1bb19ee9]
module.bastion.data.openstack_compute_flavor_v2.bastion: Reading...
module.network.data.openstack_networking_network_v2.network: Reading...
module.network.data.openstack_networking_network_v2.network: Read complete after
2s [id=f5e55ae3-c790-4a29-91e1-ce04a1acfc69]
module.network.data.openstack_networking_subnet_v2.subnet: Reading...
module.workernodes.data.openstack_compute_flavor_v2.worker: Read complete after
2s [id=1e5b0eed-6681-4305-8bc9-e20afb9f7cca]
module.bootstrapnode.data.openstack_compute_flavor_v2.bootstrap: Read complete a
fter 2s [id=874b188b-074a-4042-b0c8-3a22f04f8302]
module.masternodes.data.openstack_compute_flavor_v2.master: Read complete after
2s [id=d364331a-9f24-4784-bced-3765e0c097ed]
module.bastion.data.openstack_compute_flavor_v2.bastion: Read complete after 2s
[id=874b188b-074a-4042-b0c8-3a22f04f8302]
module.network.data.openstack_networking_subnet_v2.subnet: Read complete after 0
s [id=63011d28-987a-4ae1-a094-595f2e513a23]
module.network.openstack_networking_port_v2.bastion_port[0]: Refreshing state...
[id=91a5c711-f109-4ec0-91e7-86cd821233cc]
module.network.openstack_networking_port_v2.bootstrap_port[0]: Refreshing state.
.. [id=7072aba5-ac95-4b36-994a-1855f2624b55]
module.bastion.openstack_compute_instance_v2.bastion[0]: Refreshing state... [id
=f874bcaf-e8d5-46f6-8088-652ee3b9930a]
module.network.openstack_networking_port_v2.master_port[0]: Refreshing state...
[id=6d0e1c9a-aa11-48c3-80cd-e22c2cbe8abe]
module.network.openstack_networking_port_v2.worker_port[0]: Refreshing state...
[id=3a683058-4670-4f2a-a701-fc21e56142de]
module.bastion.null_resource.bastion_init[0]: Refreshing state... [id=5535521664
652524244]
module.bastion.openstack_compute_volume_attach_v2.storage_v_attach[0]: Refreshin
g state... [id=f874bcaf-e8d5-46f6-8088-652ee3b9930a/317b2360-639d-4d8a-8b34-a58f
1bb19ee9]
module.bastion.null_resource.bastion_register[0]: Refreshing state... [id=390765
5651701511596]
module.bastion.null_resource.enable_repos[0]: Refreshing state... [id=8446503032
307780766]
module.bastion.null_resource.bastion_packages[0]: Refreshing state... [id=756303
5921889930989]
module.bastion.null_resource.setup_nfs_disk[0]: Refreshing state... [id=57837008
01307001475]
module.workernodes.data.ignition_config.worker[0]: Reading...
module.bootstrapnode.data.ignition_config.bootstrap: Reading...
module.workernodes.data.ignition_config.worker[0]: Read complete after 0s [id=85
d98bf1d766507417ab5b578be1abe6f3e6c0a80e57a931862b80f5ff8b4153]
module.masternodes.data.ignition_config.master[0]: Reading...
module.helpernode.null_resource.config: Refreshing state... [id=3876494058890088
587]
module.masternodes.data.ignition_config.master[0]: Read complete after 0s [id=7a
035ac3f88d415956417f73f6ecd986a9d339cdbbea088f5332e0cd8a46de94]
module.bootstrapnode.data.ignition_config.bootstrap: Read complete after 0s [id=
87f77fe2ea79f17615628f4222c5676d8c8062883faa6236a4ef9d6087f86729]
module.installconfig.null_resource.pre_install[0]: Refreshing state... [id=14174
00108243665749]
module.installconfig.null_resource.install_config: Refreshing state... [id=46832
85385320449241]
module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Refreshing stat
e... [id=2c289dad-9552-4037-8105-f798406ff623]
module.bootstrapconfig.null_resource.bootstrap_config: Refreshing state... [id=6
822641950134297211]
module.masternodes.openstack_compute_instance_v2.master[0]: Refreshing state...
[id=0b820558-2077-40ee-81d9-811aa7dbc6d0]
module.bootstrapcomplete.null_resource.bootstrap_complete: Refreshing state... [
id=285966427274519477]
module.workernodes.openstack_compute_instance_v2.worker[0]: Refreshing state...
[id=78da4c4f-d882-49c0-9e6b-94100492be63]
module.workernodes.null_resource.remove_worker[0]: Refreshing state... [id=17321
02747293394534]
module.install.null_resource.install: Refreshing state... [id=287553626659034692
8]
module.install.null_resource.upgrade[0]: Refreshing state... [id=871900455648840
9504]

Terraform used the selected providers to generate the following execution plan.
Resource actions are indicated with the following symbols:

  • destroy
    -/+ destroy and then create replacement

Terraform will perform the following actions:

module.bastion.openstack_blockstorage_volume_v3.storage_volume[0] must be re

placed
-/+ resource "openstack_blockstorage_volume_v3" "storage_volume" {
~ attachment = [
- {
- device = "/dev/sdb"
- id = "317b2360-639d-4d8a-8b34-a58f1bb19ee9"
- instance_id = "f874bcaf-e8d5-46f6-8088-652ee3b9930a"
},
] -> (known after apply)
~ availability_zone = "nova" -> (known after apply)
~ id = "317b2360-639d-4d8a-8b34-a58f1bb19ee9" -> (known aft
er apply)
~ metadata = {
- "attached_mode" = "rw"
- "volume_wwn" = "60050768028105F5D0000000000002D4"
} -> (known after apply)
name = "merlin2-nfs-storage-vol"
+ region = (known after apply)
~ volume_type = "v7kamp.rch.stglabs.ibm.com base template" -> "6327
2fa4-2a99-4a94-ab1e-2a12fb64b1f8" # forces replacement
# (1 unchanged attribute hidden)
}

module.bastion.openstack_compute_volume_attach_v2.storage_v_attach[0] must b

e replaced
-/+ resource "openstack_compute_volume_attach_v2" "storage_v_attach" {
~ device = "/dev/sdb" -> (known after apply)
~ id = "f874bcaf-e8d5-46f6-8088-652ee3b9930a/317b2360-639d-4d8a-8
b34-a58f1bb19ee9" -> (known after apply)
+ region = (known after apply)
~ volume_id = "317b2360-639d-4d8a-8b34-a58f1bb19ee9" # forces replacemen
t -> (known after apply)
# (1 unchanged attribute hidden)
}

module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0] will be dest

royed

(because index [0] is out of range for count)

  • resource "openstack_compute_instance_v2" "bootstrap" {
    • access_ip_v4 = "9.5.36.167" -> null

    • all_metadata = {

      • "enforce_affinity_check" = "false"
      • "move_pin_vm" = "false"
      • "original_host" = "837542A_10C5EDW"
        } -> null
    • all_tags = [] -> null

    • availability_zone = "Default Group" -> null

    • created = "2023-05-01 21:09:11 +0000 UTC" -> null

    • flavor_id = "874b188b-074a-4042-b0c8-3a22f04f8302" -> null

    • flavor_name = "bastion_bootstrap" -> null

    • force_delete = false -> null

    • id = "2c289dad-9552-4037-8105-f798406ff623" -> null

    • image_id = "a518c74e-cd80-4c67-8724-15b2720b2108" -> null

    • image_name = "rhcos-new" -> null

    • name = "merlin2-bootstrap" -> null

    • power_state = "active" -> null

    • security_groups = [] -> null

    • stop_before_destroy = false -> null

    • updated = "2023-05-01 22:04:38 +0000 UTC" -> null

    • user_data = "eb7b092f153c6094e6202339c2b0ef36dbc518fd" -> null

    • network {

      • access_network = false -> null
      • fixed_ip_v4 = "9.5.36.167" -> null
      • mac = "fa:16:3e:5c:1d:b7" -> null
      • name = "merlin2" -> null
      • port = "7072aba5-ac95-4b36-994a-1855f2624b55" -> null
      • uuid = "f5e55ae3-c790-4a29-91e1-ce04a1acfc69" -> null
        }
        }

module.helpernode.null_resource.config must be replaced

-/+ resource "null_resource" "config" {
~ id = "3876494058890088587" -> (known after apply)
~ triggers = { # forces replacement
~ "bootstrap_count" = "1" -> "0"
# (2 unchanged elements hidden)
}
}

module.network.openstack_networking_port_v2.bootstrap_port[0] will be destro

yed

(because index [0] is out of range for count)

  • resource "openstack_networking_port_v2" "bootstrap_port" {
    • admin_state_up = true -> null

    • all_fixed_ips = [

      • "9.5.36.167",
        ] -> null
    • all_security_group_ids = [] -> null

    • all_tags = [] -> null

    • device_id = "2c289dad-9552-4037-8105-f798406ff623" -> null

    • device_owner = "compute:Default Group" -> null

    • dns_assignment = [] -> null

    • id = "7072aba5-ac95-4b36-994a-1855f2624b55" -> null

    • mac_address = "fa:16:3e:5c:1d:b7" -> null

    • name = "merlin2-bootstrap-port" -> null

    • network_id = "f5e55ae3-c790-4a29-91e1-ce04a1acfc69" -> null

    • port_security_enabled = false -> null

    • tags = [] -> null

    • tenant_id = "e4af56f8139e4418abcb29c723bf15a9" -> null

    • binding {

      • host_id = "837542A_10C5EDW" -> null
      • profile = jsonencode({})
      • vif_details = {} -> null
      • vif_type = "binding_failed" -> null
      • vnic_type = "normal" -> null
        }
    • fixed_ip {

      • ip_address = "9.5.36.167" -> null
      • subnet_id = "63011d28-987a-4ae1-a094-595f2e513a23" -> null
        }
        }

Plan: 3 to add, 0 to change, 5 to destroy.

Changes to Outputs:
~ bootstrap_ip = "9.5.36.167" -> ""

Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.

Enter a value: yes

module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Destroying... [
id=2c289dad-9552-4037-8105-f798406ff623]
module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Still destroyin
g... [id=2c289dad-9552-4037-8105-f798406ff623, 10s elapsed]
module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Still destroyin
g... [id=2c289dad-9552-4037-8105-f798406ff623, 20s elapsed]
module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Still destroyin
g... [id=2c289dad-9552-4037-8105-f798406ff623, 30s elapsed]
module.bootstrapnode.openstack_compute_instance_v2.bootstrap[0]: Destruction com
plete after 34s
module.helpernode.null_resource.config: Destroying... [id=3876494058890088587]
module.helpernode.null_resource.config: Destruction complete after 0s
module.bastion.openstack_compute_volume_attach_v2.storage_v_attach[0]: Destroyin
g... [id=f874bcaf-e8d5-46f6-8088-652ee3b9930a/317b2360-639d-4d8a-8b34-a58f1bb19e
e9]
module.network.openstack_networking_port_v2.bootstrap_port[0]: Destroying... [id
=7072aba5-ac95-4b36-994a-1855f2624b55]
module.network.openstack_networking_port_v2.bootstrap_port[0]: Destruction compl
ete after 7s
module.bastion.openstack_compute_volume_attach_v2.storage_v_attach[0]: Destructi
on complete after 9s
module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Destroying...
[id=317b2360-639d-4d8a-8b34-a58f1bb19ee9]
module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Still destroy
ing... [id=317b2360-639d-4d8a-8b34-a58f1bb19ee9, 10s elapsed]
module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Destruction c
omplete after 11s
module.helpernode.null_resource.config: Creating...
module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Creating...
module.helpernode.null_resource.config: Provisioning with 'remote-exec'...
module.helpernode.null_resource.config (remote-exec): Connecting to remote host
via SSH...
module.helpernode.null_resource.config (remote-exec): Host: 9.5.36.166
module.helpernode.null_resource.config (remote-exec): User: root
module.helpernode.null_resource.config (remote-exec): Password: false
module.helpernode.null_resource.config (remote-exec): Private key: true
module.helpernode.null_resource.config (remote-exec): Certificate: false
module.helpernode.null_resource.config (remote-exec): SSH Agent: false
module.helpernode.null_resource.config (remote-exec): Checking Host Key: false
module.helpernode.null_resource.config (remote-exec): Target Platform: unix
module.helpernode.null_resource.config (remote-exec): Connected!
module.helpernode.null_resource.config (remote-exec): Cloning into ocp4-helperno
de...
module.helpernode.null_resource.config (remote-exec): Note: switching to 'adb110
2f64b2f25a8a1b44a96c414f293d72d3fc'.

module.helpernode.null_resource.config (remote-exec): You are in 'detached HEAD'
state. You can look around, make experimental
module.helpernode.null_resource.config (remote-exec): changes and commit them, a
nd you can discard any commits you make in this
module.helpernode.null_resource.config (remote-exec): state without impacting an
y branches by switching back to a branch.

module.helpernode.null_resource.config (remote-exec): If you want to create a ne
w branch to retain commits you create, you may
module.helpernode.null_resource.config (remote-exec): do so (now or later) by us
ing -c with the switch command. Example:

module.helpernode.null_resource.config (remote-exec): git switch -c

module.helpernode.null_resource.config (remote-exec): Or undo this operation wit
h:

module.helpernode.null_resource.config (remote-exec): git switch -

module.helpernode.null_resource.config (remote-exec): Turn off this advice by se
tting config variable advice.detachedHead to false

module.helpernode.null_resource.config (remote-exec): HEAD is now at adb1102 Mer
ge pull request #305 from redhat-cop/devel
module.helpernode.null_resource.config: Provisioning with 'file'...
module.helpernode.null_resource.config: Still creating... [10s elapsed]
module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Still creatin
g... [10s elapsed]
module.helpernode.null_resource.config: Provisioning with 'file'...
module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]: Creation comp
lete after 12s [id=35ba1876-52b6-4769-9950-eaf3be077eaa]
module.bastion.openstack_compute_volume_attach_v2.storage_v_attach[0]: Creating.
..
module.helpernode.null_resource.config: Provisioning with 'file'...
module.helpernode.null_resource.config: Provisioning with 'remote-exec'...
module.helpernode.null_resource.config (remote-exec): Connecting to remote host
via SSH...
module.helpernode.null_resource.config (remote-exec): Host: 9.5.36.166
module.helpernode.null_resource.config (remote-exec): User: root
module.helpernode.null_resource.config (remote-exec): Password: false
module.helpernode.null_resource.config (remote-exec): Private key: true
module.helpernode.null_resource.config (remote-exec): Certificate: false
module.helpernode.null_resource.config (remote-exec): SSH Agent: false
module.helpernode.null_resource.config (remote-exec): Checking Host Key: false
module.helpernode.null_resource.config (remote-exec): Target Platform: unix
module.helpernode.null_resource.config (remote-exec): Connected!
module.bastion.openstack_compute_volume_attach_v2.storage_v_attach[0]: Creation
complete after 7s [id=f874bcaf-e8d5-46f6-8088-652ee3b9930a/35ba1876-52b6-4769-99
50-eaf3be077eaa]
module.helpernode.null_resource.config (remote-exec): Running ocp4-helpernode pl
aybook...
module.helpernode.null_resource.config: Still creating... [20s elapsed]
module.helpernode.null_resource.config (remote-exec): Using /root/ocp4-helpernod
e/ansible.cfg as config file

module.helpernode.null_resource.config (remote-exec): PLAY [all] ***************


module.helpernode.null_resource.config (remote-exec): TASK [Gathering Facts] ***


module.helpernode.null_resource.config: Still creating... [30s elapsed]
...

module.helpernode.null_resource.config: Still creating... [17h6m25s elapsed]

======================================================================================================

Initiating node info:

$ ps -ef | grep terraform
gjertsen 3410758 12015 0 May01 pts/1 00:08:32 terraform apply -var-file va
r.tfvars
gjertsen 3411154 3410758 0 May01 pts/1 00:00:03 .terraform/providers/registr
y.terraform.io/hashicorp/null/3.2.1/linux_amd64/terraform-provider-null_v3.2.1_x
5

======================================================================================================

bastion node state:

ps -ef | grep ansible

root 67764 67738 7 May01 pts/1 01:21:07 /usr/libexec/platform-python /usr/bin/ansible-playbook -i inventory -e @helpernode_vars.yaml tasks/main.yml -v --become
root 67771 67764 0 May01 pts/1 00:00:00 /usr/libexec/platform-python /usr/bin/ansible-playbook -i inventory -e @helpernode_vars.yaml tasks/main.yml -v --become
root 67782 1 0 May01 ? 00:00:00 ssh: /root/.ansible/cp/08610c3669 [mux]
root 67890 67771 0 May01 pts/1 00:00:00 ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="root" -o ConnectTimeout=10 -o ControlPath=/root/.ansible/cp/08610c3669 -tt 9.5.36.166 /bin/sh -c '/usr/libexec/platform-python /root/.ansible/tmp/ansible-tmp-1682979218.3504968-67771-94255852006671/AnsiballZ_setup.py && sleep 0'
root 67891 67783 0 May01 pts/3 00:00:00 /bin/sh -c /usr/libexec/platform-python /root/.ansible/tmp/ansible-tmp-1682979218.3504968-67771-94255852006671/AnsiballZ_setup.py && sleep 0
root 67912 67891 0 May01 pts/3 00:00:04 /usr/libexec/platform-python /root/.ansible/tmp/ansible-tmp-1682979218.3504968-67771-94255852006671/AnsiballZ_setup.py

NFS mount looks OK

exportfs

/export

ls -al /export

total 0
drwxrwxrwx. 3 nobody nobody 92 May 1 17:41 .
dr-xr-xr-x. 19 root root 259 May 1 17:06 ..
drwxrwxrwx. 2 nobody nobody 6 May 1 17:41 openshift-image-registry-registry-pvc-pvc-5b20c6ca-b184-41eb-b145-c5253c26015a

Not sure how I missed this issue. Suggest using markdown code format while pasting console logs.

Coming back to the root cause the main line that shows the reason for recreating the nfs disk(module.bastion.openstack_blockstorage_volume_v3.storage_volume[0]):

~ volume_type = "v7kamp.rch.stglabs.ibm.com base template" -> "6327
2fa4-2a99-4a94-ab1e-2a12fb64b1f8" # forces replacement

Seems the terraform provider for openstack is returning the storage template ID when querying the service. Which detects there is a change in the template for you as shown above. We have not used this feature recently but seems something is changed recently where only ID will work.

As a workaround please set variable volume_storage_template to a value "63272fa4-2a99-4a94-ab1e-2a12fb64b1f8" and run apply. This should not detect forced replacement change.