Deploy the latest bundle locally : mysql relationship problem with bundle.yaml file

Question

Deploy the latest bundle locally : mysql relationship problem with bundle.yaml file

Closed this issue 6 months ago · 43 comments

Hello,
I resume my tests with the latest versions especially the replacement of percona by mysql.
There is a problem with the relationship "mysql:mysql" and "slurmdbd:db" as written on the bundle.yaml file.

Thank's.
Moula.

Answer 1 · 2023-09-12T12:22:50.000Z

Hello @moula!

Thank you for trying the bundles.

They were relying on the legacy charms.

I am updating the bundles to use the current charms.

I will let you know as soon as they are updated.

Answer 2 · 2023-09-12T12:29:56.000Z

Hello @jaimesouza Thank you very much. Good Job.

Answer 3 · 2023-09-12T17:25:13.000Z

Hi @moula! Try again, please! Thanks!

Answer 4 · 2023-09-13T09:55:08.000Z

Bonjour @jaimesouza .
I just tried again, deploying it doesn't work. There is a problem with mysql and the creation of the instance on InnoDB...

Answer 5 · 2023-09-13T10:04:24.000Z

If i try without " --overlay ./slurm-core/charms/latest-edge.yaml"

Answer 6 · 2023-09-13T11:49:12.000Z

Hello @moula! What's on MySQL logs?

juju debug-log --include mysql/0 --replay

Answer 7 · 2023-09-13T13:01:38.000Z

@jaimesouza

Answer 8 · 2023-09-13T13:08:41.000Z

Ip mysql-server 127.0.0.1
Why not 172.200.0.126?
Problème with MAAS in /etc/hosts ?

Answer 9 · 2023-09-13T13:21:35.000Z

@moula, please log into the mysql node and double-check that:

$ host mysql.tychecloud.org

Apparently it is resolving to 127.0.0.1

It might be misconfigured on MAAS DNS.

It can also try to manually set the entry on /etc/hosts to test it.

Answer 10 · 2023-09-13T14:24:38.000Z

@jaimesouza Modifying the /etc/hosts file manually does not solve the problem since when rebooting the mysql-server, it always returns to 127.0.0.1, it does not save 172.200.0.126.
On Maas/Dns I can see that it is saved!!!

I'll look closely at a problem on MAAS and I'll get back to you.
Thank you. .

Answer 11 · 2023-09-15T17:56:48.000Z

Bonjour @jaimesouza
I looked at how to solve the mysql ip problem. The problem comes from the fact that Maas when deploying the bundle, so the installation of ubuntu only gives 127.0.0.1 as ip and does not take into account the fixed ips of mysql 192.168.70.126 or 172.200.0.126 for my configuration.

At this time, the host mysql.tychecloud.org command gives: mysql.tychecloud.org has address 127.0.0.1.
While the MAAS server plays its role as a local DNS server: $ resolvectl status

now the command: host mysql.tychecloud.org gives the 3 ips...

Unfortunately, even if I retry the deployment of the bundle, the deployment script does not restart mysql on ip 192... but still 127.0.0.1 and it remains blocked!!!

Thank's
Moula.

Answer 12 · 2023-09-16T09:38:07.000Z

Bonjour @jaimesouza
I tried another manual solution: modifying the listening IP address by changing the mysql bind from 127.0.0.1 to the host IP on the mysqld.cnf file, it didn't work!!!
I have the impression that once mysql-charm is launched on 127.0.0.1 initially it does not try to restart on another IP despite the manual modifications and the fact of rebooting it!!!
Have a nice week end.
Moula.

Answer 13 · 2023-09-18T12:53:51.000Z

Hello @moula! Have you tried anything else after that?

@derekcat do you have any idea on how to solve this?

Answer 14 · 2023-09-18T13:37:06.000Z

Bonjour @jaimesouza
I tried again this morning with the new versions of Maas and juju (It had lots of new features) but the problem is still the same.
1- Maas with its DNS module does not give the ip-fix address to the Ubuntu server during deployment (apart from 127.0.0.1).
2- Give the fixed IP to the deployed Ubuntu server and reboot it does not change anything.
3- mysql charm remains frozen on 127.0.0.1.
So automatic deployment of the bundle does not work.
Thank's
Moula.

Answer 15 · 2023-09-18T14:04:00.000Z

Ok @moula! I am gonna include more people in this discussion.

Would you like to join this team https://ubuntu.com/community/governance/teams/hpc?

We discuss lots of topics related to HPC and Ubuntu there. You can communicate with people on matrix and ask questions. There are knowledgeable guys who could help you with this issue or find someone else who could.

Let's move this discussion to there: https://matrix.to/#/#ubuntu-hpc:matrix.org

What do you think?

Answer 16 · 2023-09-18T15:19:58.000Z

~~@moula can you please re-try with MySQL from channel 8.0/candidate (revision 193 at the moment)? Tnx!~~
~~The upcoming revision has several fixes which may help you, e.g. canonical/mysql-operator#237~~

Just noticed you are trying 8.0/edge (revision 196)...

As replied here, the charm is getting IP from Juju in this part of the code.
Can you please check and share juju show-unit mysql/0? I suspect private-address is still localhost there.

Answer 17 · 2023-09-18T15:40:07.000Z

Bonjour @taurus-forever

Answer 18 · 2023-09-18T15:43:00.000Z

@taurus-forever
This is the configuration of the mysql node

Answer 19 · 2023-09-18T19:42:53.000Z

Hmmmm, does the node still get the wrong DNS entries if it's released from the model and redeployed elsewhere?
@jaimesouza would know better if the charm would pick up on changes, but I suspect it won't and you'd have to redeploy that part of the model (reapply bundle) after verifying that the node has the correct DNS entries deployed in a testing model.

Answer 20 · 2023-09-19T08:53:52.000Z

Hi @derekcat @jaimesouza @moula I was reviewing your comments.
If you check the link shared by @taurus-forever, I believe this is the part of the code affecting you.

One thing I noticed from your prints is that bind-address is not defined in your outputs of show-unit.

Do you mind running this command:

# Juju 2.9.x
juju run --unit mysql/0 -- network-get database-peers
# Juju 3.1.x
juju exec --unit mysql/0 -- network-get database-peers

Answer 21 · 2023-09-19T13:53:20.000Z

Bonjour @phvalguima
I use juju 2.9

Thank you.
Moula.

Answer 22 · 2023-09-19T14:24:32.000Z

Bonjour @jaimesouza As you know it's a bug with mysql and ip...
I'm waiting for this to be fixed and tested by the Canonical QA team.
II would like to thank you for your reaction and answers and the whole team.
I'm going to leave you calm for a little while to get back to the Kubeflow+Mlflow+cos projects..
Merci beaucoup.
Moula.

Answer 23 · 2023-09-19T16:04:59.000Z

Hi @moula! No problem!
Glad to help!
If you need any assistance, just let us know.

Answer 24 · 2023-10-20T12:12:25.000Z

Hey @moula!

Have the issues been resolved? If so, how did you do it?

I am having the same issue on MySQL charm on vSphere.

Answer 25 · 2023-10-20T17:48:01.000Z

Bonsoir @jaimesouza
Sorry but the issue is still there.

Answer 26 · 2023-10-24T16:36:28.000Z

Hey @moula!

@NucciTheBoss from Canonical is also investigating this issue.

I will let you know when we have a solution for it.

Thank you!

Answer 27 · 2023-10-24T22:34:10.000Z

Hi there @moula - are you sure that you have DHCP configured correctly in your MAAS cluster? Reason saying is that I am looking through the screenshots that you have shared and it seems that your nodes are having trouble connecting to the network. For example, I notice in your attached screenshots that your nodes are failing to pull the SLURM packages from Omnivector's PPA on Launchpad. It's likely that they are unable to connect to the external network. What's likely happening with MySQL is that it is defaulting to /etc/hosts because it is either unable to contact your DNS server, or the node has not been assigned an IP by your DHCP server.

Could you both (including @jaimesouza) make sure that you have your DHCP and DNS servers configured correctly and that your deployed nodes can reach the external network. e.g. does ping ubuntu.com successfully transmit and receive ICMP packets? I'm not able to reproduce these issues you're having on either LXD, OpenStack, AWS, or GCP, and they usually include their own DNS and DHCP services.

Answer 28 · 2023-10-24T22:36:00.000Z

See section 10.4 of the Debian networking reference for an explanation on why MySQL is resolving to 127.0.1.1 if you do not have an assigned IP address for your instance.

Answer 29 · 2023-10-24T23:34:37.000Z

Hi there @moula - are you sure that you have DHCP configured correctly in your MAAS cluster? Reason saying is that I am looking through the screenshots that you have shared and it seems that your nodes are having trouble connecting to the network. For example, I notice in your attached screenshots that your nodes are failing to pull the SLURM packages from Omnivector's PPA on Launchpad. It's likely that they are unable to connect to the external network. What's likely happening with MySQL is that it is defaulting to /etc/hosts because it is either unable to contact your DNS server, or the node has not been assigned an IP by your DHCP server.

Looking at the error reported from the slurmd/0 unit (shown in the first 2 screen captures provided in this comment, I am curious if there's a network proxy in the way. Can you comment on whether this is the case or not? If so, have you set any model configuration options in Juju for the proxies? Additionally, I will note that the add-apt-repository command for resolving the PPA URL in the short format will make queries out to the launchpad APIs. If these are restricted by local network settings, it will cause these queries to fail and be unable to resolve the target URLs for adding the apt-source strings. This is something that can, and should be, handled by the slurm charms here.

Additionally, the error reported by the mysql-router/0 unit is indicating that the charmed-mysql snap is already installed. The installation hook is fairly simple, where the primary purpose is to install the snap itself. However, perusing the code here, I can see a scenario when the asynchronous task of installing the snap fails to complete before the charm logic times out the installation ultimately causing the hook to error out. Upon re-running the installation hook, there's a simple check that the mysql-router charm is using to see if the snap is already installed. However, it doesn't take into account that the snap may already be installed and re-executing on a failed install hook scenario. This may or may not be by design, but I'll raise a bug over against the mysql-router charm to be able to handle this scenario and let them weigh in on it (cc @taurus-forever).

These stack traces provided don't capture the mysql-operator logs themselves. However, the juju status output provided in the 3rd screenshot clearly indicates that the mysql service was unable to be configured which matches the original condition. Ultimately, I think there's a few things that need to be cleared up here - but the data there doesn't help identify what the problem is in the mysql-operator case.

Answer 30 · 2023-10-24T23:45:08.000Z

@NucciTheBoss Hi Jason. Thank you for your message and your work. I just want to tell you that on this platform, I have 15 bundles that have always worked. Sunbeam bundle, openstack bundle, microk8s bundle, microceph, Charmed-k8s, kubeflow, mlflow, cos, postgresql, mysql... Even the old hpc bundle on focal worked, so you understand what I mean. This is the only bundle that no longer deploys since you replaced percona with mysql-8. one bundle out of 15 which is no longer deployed on my platform with 25 Asus physical servers!!!. Even the mysql bundle doesn't deploy on its own, I sent the bug to the mysql bundle team some time ago but I managed to deploy it by individual charm, to tell you the truth. Something else I tested since yesterday on juju-controller 2.9 and on juju-controller 3.1 but the problem is the same. even now I'm still testing.

Answer 31 · 2023-10-24T23:48:05.000Z

@NucciTheBoss The IP of the mysql node is fixed at the maas level. 192.168.70.130 && 172.200.0.130.

Answer 32 · 2023-10-24T23:52:13.000Z

I restarted the live deployment but the issue is still the same :

Answer 33 · 2023-10-24T23:54:16.000Z

moula commented 8 months ago

Answer 34 · 2023-10-24T23:58:16.000Z

Hi there @moula - are you sure that you have DHCP configured correctly in your MAAS cluster? Reason saying is that I am looking through the screenshots that you have shared and it seems that your nodes are having trouble connecting to the network. For example, I notice in your attached screenshots that your nodes are failing to pull the SLURM packages from Omnivector's PPA on Launchpad. It's likely that they are unable to connect to the external network. What's likely happening with MySQL is that it is defaulting to /etc/hosts because it is either unable to contact your DNS server, or the node has not been assigned an IP by your DHCP server.

Looking at the error reported from the slurmd/0 unit (shown in the first 2 screen captures provided in this comment, I am curious if there's a network proxy in the way. Can you comment on whether this is the case or not? If so, have you set any model configuration options in Juju for the proxies? Additionally, I will note that the add-apt-repository command for resolving the PPA URL in the short format will make queries out to the launchpad APIs. If these are restricted by local network settings, it will cause these queries to fail and be unable to resolve the target URLs for adding the apt-source strings. This is something that can, and should be, handled by the slurm charms here.

Additionally, the error reported by the mysql-router/0 unit is indicating that the charmed-mysql snap is already installed. The installation hook is fairly simple, where the primary purpose is to install the snap itself. However, perusing the code here, I can see a scenario when the asynchronous task of installing the snap fails to complete before the charm logic times out the installation ultimately causing the hook to error out. Upon re-running the installation hook, there's a simple check that the mysql-router charm is using to see if the snap is already installed. However, it doesn't take into account that the snap may already be installed and re-executing on a failed install hook scenario. This may or may not be by design, but I'll raise a bug over against the mysql-router charm to be able to handle this scenario and let them weigh in on it (cc @taurus-forever).

These stack traces provided don't capture the mysql-operator logs themselves. However, the juju status output provided in the 3rd screenshot clearly indicates that the mysql service was unable to be configured which matches the original condition. Ultimately, I think there's a few things that need to be cleared up here - but the data there doesn't help identify what the problem is in the mysql-operator case.

Hi @wolsen No, it was just an oversight in this image. Since then I have tried again several times as currently live.

Answer 35 · 2023-10-25T00:17:50.000Z

@moula thank you for refreshing your deployment; clears up some of my initial confusion. Can you ssh into each of the nodes to ensure that they can access the external network?

For example, could you try contacting ubuntu.com from mysql/0:

juju ssh mysql/0
ping ubuntu.com

Can you then verify that mysql/0 can reach slurmdbd/0?

ping <ip_for_slurmdbd/0>

I just want to ensure that your nodes can reach the network. This will help narrow down where the issue is located within the cluster.

Answer 36 · 2023-10-25T00:23:37.000Z

@NucciTheBoss

Answer 37 · 2023-10-25T00:29:00.000Z

@NucciTheBoss I have another error message, after rebooting the nodes as requested during the installation of the jammy OS.

Answer 38 · 2023-10-25T00:35:30.000Z

@NucciTheBoss I have a question :
After modifying the ips of the nodes in the file: /etc/cloud/templates/hosts.debian.tmpl and restarting the nodes. Are there any commands to force the bundle deployment to re-execute?
moula@maas:~/slurm-bundles$ juju deploy ./slurm-core/bundle.yaml --overlay ./slurm-core/charms/latest-edge.yaml --force
No changes to apply.

Answer 39 · 2023-10-25T13:17:22.000Z

@moula

The issue is on /etc/hosts indeed. For some reason, my vSphere VM comes with the following line:

127.0.1.1 juju-39c062-25 juju-39c062-25

You have the same thing.

I have tested pre-adding the machine, commenting that line, rebooting the machine, and deploying the charm on it after that.

It deployed successfully!

I need to make the changes on /etc/cloud/templates/hosts.debian.tmpl to make it persistent after reboot.

Answer 40 · 2023-10-25T13:59:24.000Z

Thanks @jaimesouza for testing this out. This can be a stop-gap patch in the short term; however, we should identify what the root cause of the issue is.

We're looking at what the issue could possibly be within the MySQL operator. It's obvious now that the nodes in both your clouds are being assigned IPs via DHCP and able to contact the network, so we need to figure out why the MySQL operator is resolving the IP to be 127.0.1.1. Don't want to patch /etc/hosts if we don't have to 😅

Answer 41 · 2023-10-25T14:06:22.000Z

@NucciTheBoss I have a question :
After modifying the ips of the nodes in the file: /etc/cloud/templates/hosts.debian.tmpl and restarting the nodes. Are there any commands to force the bundle deployment to re-execute?
moula@maas:~/slurm-bundles$ juju deploy ./slurm-core/bundle.yaml --overlay ./slurm-core/charms/latest-edge.yaml --force
No changes to apply.

I don't think so. Generally you shouldn't need to reexecute the bundle deployment as that goes against the autonomous nature of Juju. Juju will automatically attempt to rerun failed hooks.

If you have failed hooks in your cluster, you can follow this documentation here for how to retrigger the hook: https://juju.is/docs/sdk/debug-a-charm#heading--debug-a-single-failing-hook. This documentation is more meant for debugging, but you can use it to rerun hooks should you need too!

Answer 42 · 2023-10-25T14:38:50.000Z

@moula

The issue is on /etc/hosts indeed. For some reason, my vSphere VM comes with the following line:
e
127.0.1.1 juju-39c062-25 juju-39c062-25

You have the same thing.

I have tested pre-adding the machine, commenting that line, rebooting the machine, and deploying the charm on it after that.

It deployed successfully!

I need to make the changes on /etc/cloud/templates/hosts.debian.tmpl to make it persistent after reboot.

@jaimesouza Thank's.
I did the same thing with the mysql node and added the ip fix to it once the ubuntu OS was installed and rebooted but for the moment it doesn't work, hence my last question to @NucciTheBoss. I hope that with your modifications it will work. Even they are making changes to the mysql-router charm.
Thank you to all of you.

Answer 43 · 2023-11-13T11:11:03.000Z

hi guys. I think our bug will be fixed soon. I will test it as soon as it is applied. THANKS.
canonical/mysql-router-operator#84.
Moula.