canonical/microcloud

MicroCloud init fails

Closed this issue · 5 comments

I have been playing with microcloud setup for a while now.
I have three arm64 boards all with ubuntu core 22 images
2 Xilinx KV260 and a Mediatek board.

While the setup works fine . But quite a few times it fails during microcloud init giving the following error

Error: System "cobuntu" failed to join the cluster: Failed to update cluster status of services: Failed to join "LXD" cluster: LXD has already been initialized

On the respective node(in this case cobuntu) if you check logs of microcloud daemon it gives following error.
Sep 30 12:48:50 cobuntu microcloud.daemon[1811]: time="2023-09-30T12:48:50Z" level=error msg="Failed to start server" err="accept tcp [::]:9443: use of closed network connection"
@tomponline explained during product roadmap sprint
The reason for this can be
two different SOCs being used as nodes for cluster formation and xilinx boards being short on memory and processing power.
I was just rasing it here since I never observed CPU usage or memory usage exceeding boards limit .

`
lxd 5.19-31ff7b6 26096 latest/stable canonical** -
microceph 0+git.7b5672b 710 quincy/stable canonical** -
microcloud 1.1-04a1c49 737 latest/stable canonical** -
microovn 22.03.3+snap2d1a04de44 301 22.03/stable canonical** -

`

Hi @prash813, do I have the right understanding that you are redeploying MicroCloud on the same arm64 boards that you have described above and sometimes this process fails with the given error message?

You cannot run microcloud init twice on the same system. Normally this produces the error Error: LXD has already been initialized.

I was just checking a fresh MicroCloud installation and I can also see the following log message when running snap logs microcloud:

2023-11-23T14:22:12Z microcloud.daemon[2479]: time="2023-11-23T14:22:12Z" level=error msg="Failed to start server" err="accept tcp [::]:9443: use of closed network connection"

@masnax have you seen this message before? Maybe we should investigate this separately.

I was just checking a fresh MicroCloud installation and I can also see the following log message when running snap logs microcloud:

2023-11-23T14:22:12Z microcloud.daemon[2479]: time="2023-11-23T14:22:12Z" level=error msg="Failed to start server" err="accept tcp [::]:9443: use of closed network connection"

@masnax have you seen this message before? Maybe we should investigate this separately.

This isn't actually a problem. MicroCloud starts up a basic listener when the daemons start, and this gets torn down when the system is bootstrapped. That error is from the listener closing before the context is canceled. We can clean up the log but it's not actually affecting anything.

@roosterfish ,the error in comment# 1 doesn't only come while redeploying microcloud
I have seen this error in following 2 scenarios

  1. On boards with fresh installation of Ubuntu Core if you try to deploy microcloud.
  2. With microcloud already setup on the boards, purge all those four snaps and reboot the boards and then try to redeploy the microcloud.

I would rather want to know what is the reason for the first part of the error
" Error: System "cobuntu" failed to join the cluster: Failed to update cluster status of services: "

It will be nice if I can understand what exactly is going wrong with that board . I can see console working fine for that board I can type commands like vmstat or top or free to check cpu and memory usage and it looks ok to me... never more than 75%

@roosterfish ,the error in comment# 1 doesn't only come while redeploying microcloud I have seen this error in following 2 scenarios

1. On boards with fresh installation of Ubuntu Core if you try to deploy microcloud.

2. With microcloud already setup on the boards, purge all those four snaps and reboot the boards  and then try to redeploy the microcloud.

I would rather want to know what is the reason for the first part of the error " Error: System "cobuntu" failed to join the cluster: Failed to update cluster status of services: "

It will be nice if I can understand what exactly is going wrong with that board . I can see console working fine for that board I can type commands like vmstat or top or free to check cpu and memory usage and it looks ok to me... never more than 75%

This is happening because MicroCloud is sending a notification to LXD on cobuntu telling it that it should ask to join the LXD cluster. However, cobuntu is reporting back that LXD has been already been configured there, so it can't join another cluster.

The way MicroCloud is determining that LXD has been configured is to check if LXD has configured any storage pools on that system.

You can check this with lxc storage list. If that's reporting anything, then you'll have to clean that up on each system.

Could we improve the error saying it has storage pools perhaps?