canonical/microcloud

Error at microcloud init

tmihoc opened this issue · 11 comments

tmihoc commented

I was following the Microcloud tutorial but got stuck at the microcloud initialization step.

Details:

I ran all the commands inside a Multipass VM (multipass launch --name test-vm --cpus 8 --disk 100G --memory 8G). Other than that, I followed all the instructions in the tutorial exactly. The error I got happened at the end of the microcloud init step, after I'd specified everything, and it looked as below:

...
Initializing a new cluster
 Local MicroCloud is ready
 Local LXD is ready
 Local MicroOVN is ready
 Local MicroCeph is ready
Awaiting cluster formation ...
 Peer "micro4" has joined the cluster
Error: write vsock vm(4294967295):928159682->vm(4):8443: broken pipe
ubuntu@test-vm:~$ 

Hi @tmihoc, thanks for reporting this. Do you spawn the additional four machines within the Multipass VM using nested virtualization? If that is the case please have a look at the hardware requirements for MicroCloud https://canonical-microcloud.readthedocs-hosted.com/en/latest/reference/#hardware-requirements. Having a single machine with 8G is not sufficient.

@roosterfish Aside from doing everything on the Multipass VM (as I said, with the following specs: multipass launch --name test-vm --cpus 8 --disk 100G --memory 8G), I followed the instructions from the MicroCloud tutorial exactly. That is, I created 4 LXD VMs. Here's the full history:

ubuntu@test-vm:~$ snap version
snap    2.60.4
snapd   2.60.4
series  16
ubuntu  22.04
kernel  5.15.0-87-generic
ubuntu@test-vm:~$ sudo snap install lxd
snap "lxd" is already installed, see 'snap help refresh'
ubuntu@test-vm:~$ lxd init
Would you like to use LXD clustering? (yes/no) [default=no]: 
Do you want to configure a new storage pool? (yes/no) [default=yes]: 
Name of the new storage pool [default=default]: 
Name of the storage backend to use (ceph, cephobject, dir, lvm, zfs, btrfs) [default=zfs]: 
Create a new ZFS pool? (yes/no) [default=yes]: 
Would you like to use an existing empty block device (e.g. a disk or partition)? (yes/no) [default=no]: 
Size in GiB of the new loop device (1GiB minimum) [default=19GiB]: 40GiB
Would you like to connect to a MAAS server? (yes/no) [default=no]: 
Would you like to create a new local network bridge? (yes/no) [default=yes]: 
What should the new bridge be called? [default=lxdbr0]: 
What IPv4 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: 
What IPv6 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: 
Would you like the LXD server to be available over the network? (yes/no) [default=no]: yes
Address to bind LXD to (not including port) [default=all]: 
Port to bind LXD to [default=8443]: 
Trust password for new clients: 
Again: 
Invalid input, try again.

Trust password for new clients: 
Again: 
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]: 
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]: 
ubuntu@test-vm:~$ lxd
Error: This must be run as root
ubuntu@test-vm:~$ sudo lxd
ERROR  [2023-11-14T10:30:16+01:00] Failed to start the daemon                    err="LXD is already running"
ERROR  [2023-11-14T10:30:16+01:00] Unable to run feature checks during QEMU initialization: QEMU premature exit: exit status 1 (qemu: could not load PC BIOS '/usr/share/OVMF/OVMF_CODE.fd') 
WARNING[2023-11-14T10:30:16+01:00] Instance type not operational                 driver=qemu err="QEMU failed to run feature checks" type=virtual-machine
Error: LXD is already running
ubuntu@test-vm:~$ lxc storage create disks zfs size=100GiB
To start your first container, try: lxc launch ubuntu:22.04
Or for a virtual machine: lxc launch ubuntu:22.04 --vm

Storage pool disks created
ubuntu@test-vm:~$ lxc storage set disks volume.size 10GiB
ubuntu@test-vm:~$ lxc storage volume create disks local1 --type block
Storage volume local1 created
ubuntu@test-vm:~$ lxc storage volume create disks local2 --type block
Storage volume local2 created
ubuntu@test-vm:~$ lxc storage volume create disks local3 --type block
Storage volume local3 created
ubuntu@test-vm:~$ lxc storage volume create disks local4 --type block
Storage volume local4 created
ubuntu@test-vm:~$ lxc storage volume create disks remote1 --type block size=20GiB
Storage volume remote1 created
ubuntu@test-vm:~$ lxc storage volume create disks remote2 --type block size=20GiB
Storage volume remote2 created
ubuntu@test-vm:~$ lxc storage volume create disks remote3 --type block size=20GiB
Storage volume remote3 created
ubuntu@test-vm:~$ lxd storage volume list disks
Error: unknown command "storage" for "lxd"
ubuntu@test-vm:~$ lxc storage volume list disks
+--------+---------+-------------+--------------+---------+
|  TYPE  |  NAME   | DESCRIPTION | CONTENT-TYPE | USED BY |
+--------+---------+-------------+--------------+---------+
| custom | local1  |             | block        | 0       |
+--------+---------+-------------+--------------+---------+
| custom | local2  |             | block        | 0       |
+--------+---------+-------------+--------------+---------+
| custom | local3  |             | block        | 0       |
+--------+---------+-------------+--------------+---------+
| custom | local4  |             | block        | 0       |
+--------+---------+-------------+--------------+---------+
| custom | remote1 |             | block        | 0       |
+--------+---------+-------------+--------------+---------+
| custom | remote2 |             | block        | 0       |
+--------+---------+-------------+--------------+---------+
| custom | remote3 |             | block        | 0       |
+--------+---------+-------------+--------------+---------+
ubuntu@test-vm:~$ lxc network create microbr0
Network microbr0 created
ubuntu@test-vm:~$ lxc network get microbr0 ipv4.address
[10.39.29.1/24](http://10.39.29.1/24)
ubuntu@test-vm:~$ lxc network get microbr0 ipv6.address
fd42:4c13:31e2:3213::1/64
ubuntu@test-vm:~$ lxc init ubuntu:22.04 micro1 --vm --config limits.cpu=2 --config limits.memory=2GiB
Creating micro1
ubuntu@test-vm:~$ lxc init ubuntu:22.04 micro2 --vm --config limits.cpu=2 --config limits.memory=2GiB
Creating micro2
ubuntu@test-vm:~$ lxc init ubuntu:22.04 micro3 --vm --config limits.cpu=2 --config limits.memory=2GiB
Creating micro3
ubuntu@test-vm:~$ lxc init ubuntu:22.04 micro4 --vm --config limits.cpu=2 --config limits.memory=2GiB
Creating micro4
ubuntu@test-vm:~$ lxc storage volume attach disks local1 micro1
ubuntu@test-vm:~$ lxc storage volume attach disks local2 micro2
ubuntu@test-vm:~$ lxc storage volume attach disks local3 micro3
ubuntu@test-vm:~$ lxc storage volume attach disks local4 micro4
ubuntu@test-vm:~$ lxc storage volume attach disks remote1 micro1
ubuntu@test-vm:~$ lxc storage volume attach disks remote2 micro2
ubuntu@test-vm:~$ lxc storage volume attach disks remote3 micro3
ubuntu@test-vm:~$ lxc config device add micro1 eth1 nic network=microbr0 name=eth1
Device eth1 added to micro1
ubuntu@test-vm:~$ lxc config device add micro2 eth1 nic network=microbr0 name=eth1
Device eth1 added to micro2
ubuntu@test-vm:~$ lxc config device add micro3 eth1 nic network=microbr0 name=eth1
Device eth1 added to micro3
ubuntu@test-vm:~$ lxc config device add micro4 eth1 nic network=microbr0 name=eth1
Device eth1 added to micro4
ubuntu@test-vm:~$ lxc start micro1
ubuntu@test-vm:~$ lxc start micro2
ubuntu@test-vm:~$ lxc start micro3
ubuntu@test-vm:~$ lxc start micro4
ubuntu@test-vm:~$ lxc exec micro1 -- bash
root@micro1:~# echo 0 > /proc/sys/net/ipv6/conf/enp6s0/accept_ra
root@micro1:~# ip link set enp6s0 up
root@micro1:~# snap install microceph microovn microcloud --cohort="+"
microceph (quincy/stable) 0+git.7b5672b from Canonical✓ installed
microcloud 1.1-04a1c49 from Canonical✓ installed
microovn (22.03/stable) 22.03.3+snap2d1a04de44 from Canonical✓ installed
root@micro1:~# snap refresh lxd --channel=latest/stable --cohort="+"
2023-11-14T09:39:09Z INFO Waiting for "snap.lxd.daemon.service" to stop.
lxd 5.19-31ff7b6 from Canonical✓ refreshed
root@micro1:~# exit
exit
ubuntu@test-vm:~$ lxc exec micro2 -- bash
root@micro2:~# echo 0 > /proc/sys/net/ipv6/conf/enp6s0/accept_ra
root@micro2:~# ip link set enp6s0 up
root@micro2:~# snap install microceph microovn microcloud --cohort="+"
2023-11-14T09:40:20Z INFO Waiting for conflicting change in progress: conflicting snap microovn with task "setup-profiles"
2023-11-14T09:40:26Z INFO Waiting for conflicting change in progress: conflicting slot snap snapd, task "connect"
microceph (quincy/stable) 0+git.7b5672b from Canonical✓ installed
microcloud 1.1-04a1c49 from Canonical✓ installed
microovn (22.03/stable) 22.03.3+snap2d1a04de44 from Canonical✓ installed
root@micro2:~# snap refresh lxd --channel=latest/stable --cohort="+"
2023-11-14T09:40:49Z INFO Waiting for "snap.lxd.daemon.unix.socket", "snap.lxd.daemon.service" to stop.
lxd 5.19-31ff7b6 from Canonical✓ refreshed
root@micro2:~# exit
exit
ubuntu@test-vm:~$ lxc exec micro3 -- bash
root@micro3:~# echo 0 > /proc/sys/net/ipv6/conf/enp6s0/accept_ra
root@micro3:~# ip link set enp6s0 up
root@micro3:~# snap install microceph microovn microcloud --cohort="+"
microcloud 1.1-04a1c49 from Canonical✓ installed
microovn (22.03/stable) 22.03.3+snap2d1a04de44 from Canonical✓ installed
microceph (quincy/stable) 0+git.7b5672b from Canonical✓ installed
root@micro3:~# snap refresh lxd --channel=latest/stable --cohort="+"
2023-11-14T09:43:21Z INFO Waiting for "snap.lxd.daemon.service" to stop.
lxd 5.19-31ff7b6 from Canonical✓ refreshed
root@micro3:~# exit
exit
ubuntu@test-vm:~$ lxc exec micro4 -- bash
root@micro4:~# echo 0 > /proc/sys/net/ipv6/conf/enp6s0/accept_ra
root@micro4:~# ip link set enp6s0 up
root@micro4:~# snap install microceph microovn microcloud --cohort="+"
microceph (quincy/stable) 0+git.7b5672b from Canonical✓ installed
microcloud 1.1-04a1c49 from Canonical✓ installed
microovn (22.03/stable) 22.03.3+snap2d1a04de44 from Canonical✓ installed
root@micro4:~# snap refresh lxd --channel=latest/stable --cohort="+"
2023-11-14T09:45:16Z INFO Waiting for "snap.lxd.daemon.service" to stop.
lxd 5.19-31ff7b6 from Canonical✓ refreshed
root@micro4:~# exit
exit
ubuntu@test-vm:~$ lxc exec micro1 -- bash
root@micro1:~# microcloud init
Waiting for LXD to start...
Select an address for MicroCloud's internal traffic:

 Using address "10.143.231.246" for MicroCloud

Limit search for other MicroCloud servers to [10.143.231.246/24](http://10.143.231.246/24)? (yes/no) [default=yes]: 
Scanning for eligible servers ...

 Selected "micro1" at "10.143.231.246"
 Selected "micro2" at "10.143.231.111"
 Selected "micro3" at "10.143.231.118"
 Selected "micro4" at "10.143.231.196"

Would you like to set up local storage? (yes/no) [default=yes]: 
Select exactly one disk from each cluster member:

Select which disks to wipe:

 Using "/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_lxd_local1" on "micro1" for local storage pool
 Using "/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_lxd_local2" on "micro2" for local storage pool
 Using "/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_lxd_local3" on "micro3" for local storage pool
 Using "/dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_lxd_local4" on "micro4" for local storage pool

Would you like to set up distributed storage? (yes/no) [default=yes]: 
Unable to find disks on some systems. Continue anyway? (yes/no) [default=yes]: 
Select from the available unpartitioned disks:

Select which disks to wipe:

 Using 1 disk(s) on "micro1" for remote storage pool
 Using 1 disk(s) on "micro2" for remote storage pool
 Using 1 disk(s) on "micro3" for remote storage pool

Configure distributed networking? (yes/no) [default=yes]: 
Select exactly one network interface from each cluster member:

 Using "enp6s0" on "micro2" for OVN uplink
 Using "enp6s0" on "micro1" for OVN uplink
 Using "enp6s0" on "micro3" for OVN uplink
 Using "enp6s0" on "micro4" for OVN uplink

Specify the IPv4 gateway (CIDR) on the uplink network (empty to skip IPv4): [10.39.29.1/24](http://10.39.29.1/24)
Specify the first IPv4 address in the range to use with LXD: 10.39.29.100 
Specify the last IPv4 address in the range to use with LXD: 10.39.29.254
Specify the IPv6 gateway (CIDR) on the uplink network (empty to skip IPv6): fd42:4c13:31e2:3213::1/64
Initializing a new cluster
 Local MicroCloud is ready
 Local LXD is ready
 Local MicroOVN is ready
 Local MicroCeph is ready
Awaiting cluster formation ...
 Peer "micro4" has joined the cluster
Error: write vsock vm(4294967295):928159682->vm(4):8443: broken pipe
ubuntu@test-vm:~$

@tmihoc ah ok, I wasn't aware that the hardware recommendations don't match with what is written in the tutorial.

@ru-fu should we align them?

@tmihoc I was able to reproduce the issue. The original assumption was correct. The VM is dying since it is out of memory:

Specify the last IPv4 address in the range to use with LXD: 10.80.81.150 
Specify the IPv6 gateway (CIDR) on the uplink network (empty to skip IPv6): fd42:c1a5:afae:132e::1/64
Initializing a new cluster
 Local MicroCloud is ready
 Local LXD is ready
 Local MicroOVN is ready
 Local MicroCeph is ready
Awaiting cluster formation ...
 Peer "micro4" has joined the cluster
Error: read vsock vm(4294967295):165007288->vm(4):8443: connection reset by peer

You can check using dmesg on the Multipass VM:

root@test-vm:~# dmesg
...
[ 1759.686827] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=qemu-system-x86,pid=29855,uid=999
[ 1759.686853] Out of memory: Killed process 29855 (qemu-system-x86) total-vm:3084808kB, anon-rss:48208kB, file-rss:0kB, shmem-rss:2088624kB, UID:999 pgtables:4980kB oom_score_adj:0
[ 1761.744396] oom_reaper: reaped process 29855 (qemu-system-x86), now anon-rss:0kB, file-rss:0kB, shmem-rss:2088932kB

Can you try running the test outside of the Multipass VM? I would suspect it's ultimately due to the nesting, but you can also try with more memory on the test-vm and the nested micro(1|2|3|4) VMs.

@roosterfish

I would suspect it's ultimately due to the nesting

Does that mean it is not possible to try it out inside a Multipass VM?

you can also try with more memory on the test-vm and the nested micro(1|2|3|4) VMs.

How much would you recommend?

The tutorial doesn't expect that you are running it within a Multipass VM. I would not recommend it because you would have three levels of nesting as soon as you would start another instance on the MicroCloud inside of the Multipass VM.

I haven't tested it in the past using Multipass VMs but you could try with 10GB for the Multipass VM. This would leave 2GB for the actual VMs OS. In your example all of the 8GB get assigned to the micro* VMs which might be the root cause here.

The tutorial doesn't expect that you are running it within a Multipass VM.

Fair point.

I would not recommend it because you would have three levels of nesting as soon as you would start another instance on the MicroCloud inside of the Multipass VM.

This is no different than using a Multipass VM to test Juju with the LXD or MicroK8s cloud, though, is it? And we do that all the time.

I haven't tested it in the past using Multipass VMs but you could try with 10GB for the Multipass VM. This would leave 2GB for the actual VMs OS. In your example all of the 8GB get assigned to the micro* VMs which might be the root cause here.

Thanks a lot, I'll try that!

I would not recommend it because you would have three levels of nesting as soon as you would start another instance on the MicroCloud inside of the Multipass VM.

This is no different than using a Multipass VM to test Juju with the LXD or MicroK8s cloud, though, is it? And we do that all the time.

I take that back -- there we just had containers inside a VM but this time it was VMs within a VM, so yeah, not the same.

@tmihoc ah ok, I wasn't aware that the hardware recommendations don't match with what is written in the tutorial.

@ru-fu should we align them?

We want the tutorial to be easy to run on a normal workstation. Requiring 8 GiB of RAM for each of the 4 VMs would make that impossible for a lot of users - and we're only launching a handful of instances as part of the tutorial, so we can do with less RAM. I can add a note though explaining that the tutorial uses less than the recommended RAM.

Sounds good to me, thanks!

@tmihoc did you manage to deploy the MicroCloud without errors?

@roosterfish @ru-fu I haven't had a chance to try again. Will let you know when I do. In the meantime, having thought about it some more: I now understand your concerns about why using a Multipass VM might be too much virtualization and so why the tutorial is probably correct as is and why I should try it again directly on my workstation. I think we can close this issue.