charmed-hpc/slurm-bundles

Deploy the latest bundle locally : mysql relationship problem with bundle.yaml file

Closed this issue · 43 comments

Hello,
I resume my tests with the latest versions especially the replacement of percona by mysql.
There is a problem with the relationship "mysql:mysql" and "slurmdbd:db" as written on the bundle.yaml file.
Capture d’écran du 2023-09-09 15-21-21

Capture d’écran du 2023-09-09 15-22-10

Thank's.
Moula.

Hello @moula!

Thank you for trying the bundles.

They were relying on the legacy charms.

I am updating the bundles to use the current charms.

I will let you know as soon as they are updated.

Hello @jaimesouza Thank you very much. Good Job.

Hi @moula! Try again, please! Thanks!

Bonjour @jaimesouza .
I just tried again, deploying it doesn't work. There is a problem with mysql and the creation of the instance on InnoDB...
Capture d’écran du 2023-09-13 11-54-19

If i try without " --overlay ./slurm-core/charms/latest-edge.yaml"
Capture d’écran du 2023-09-13 12-01-38

Hello @moula! What's on MySQL logs?

juju debug-log --include mysql/0 --replay

Ip mysql-server 127.0.0.1
Why not 172.200.0.126?
Problème with MAAS in /etc/hosts ?

@moula, please log into the mysql node and double-check that:

$ host mysql.tychecloud.org

Apparently it is resolving to 127.0.0.1

It might be misconfigured on MAAS DNS.

It can also try to manually set the entry on /etc/hosts to test it.

@jaimesouza Modifying the /etc/hosts file manually does not solve the problem since when rebooting the mysql-server, it always returns to 127.0.0.1, it does not save 172.200.0.126.
On Maas/Dns I can see that it is saved!!!
Capture d’écran du 2023-09-13 16-22-29

I'll look closely at a problem on MAAS and I'll get back to you.
Thank you. .

Bonjour @jaimesouza
I looked at how to solve the mysql ip problem. The problem comes from the fact that Maas when deploying the bundle, so the installation of ubuntu only gives 127.0.0.1 as ip and does not take into account the fixed ips of mysql 192.168.70.126 or 172.200.0.126 for my configuration.
Capture d’écran du 2023-09-15 19-48-50

At this time, the host mysql.tychecloud.org command gives: mysql.tychecloud.org has address 127.0.0.1.
While the MAAS server plays its role as a local DNS server: $ resolvectl status
Capture d’écran du 2023-09-15 19-44-28
Capture d’écran du 2023-09-15 19-43-39
now the command: host mysql.tychecloud.org gives the 3 ips...
Capture d’écran du 2023-09-15 19-41-19
Unfortunately, even if I retry the deployment of the bundle, the deployment script does not restart mysql on ip 192... but still 127.0.0.1 and it remains blocked!!!

Capture d’écran du 2023-09-15 19-55-28

Thank's
Moula.

Bonjour @jaimesouza
I tried another manual solution: modifying the listening IP address by changing the mysql bind from 127.0.0.1 to the host IP on the mysqld.cnf file, it didn't work!!!
I have the impression that once mysql-charm is launched on 127.0.0.1 initially it does not try to restart on another IP despite the manual modifications and the fact of rebooting it!!!
Have a nice week end.
Moula.

Hello @moula! Have you tried anything else after that?

@derekcat do you have any idea on how to solve this?

Bonjour @jaimesouza
I tried again this morning with the new versions of Maas and juju (It had lots of new features) but the problem is still the same.
1- Maas with its DNS module does not give the ip-fix address to the Ubuntu server during deployment (apart from 127.0.0.1).
2- Give the fixed IP to the deployed Ubuntu server and reboot it does not change anything.
3- mysql charm remains frozen on 127.0.0.1.
So automatic deployment of the bundle does not work.
Thank's
Moula.

Ok @moula! I am gonna include more people in this discussion.

Would you like to join this team https://ubuntu.com/community/governance/teams/hpc?

We discuss lots of topics related to HPC and Ubuntu there. You can communicate with people on matrix and ask questions. There are knowledgeable guys who could help you with this issue or find someone else who could.

Let's move this discussion to there: https://matrix.to/#/#ubuntu-hpc:matrix.org

What do you think?

@moula can you please re-try with MySQL from channel 8.0/candidate (revision 193 at the moment)? Tnx!
The upcoming revision has several fixes which may help you, e.g. canonical/mysql-operator#237

Just noticed you are trying 8.0/edge (revision 196)...

As replied here, the charm is getting IP from Juju in this part of the code.
Can you please check and share juju show-unit mysql/0? I suspect private-address is still localhost there.

@taurus-forever
This is the configuration of the mysql node
Capture d’écran du 2023-09-18 17-41-01

Hmmmm, does the node still get the wrong DNS entries if it's released from the model and redeployed elsewhere?
@jaimesouza would know better if the charm would pick up on changes, but I suspect it won't and you'd have to redeploy that part of the model (reapply bundle) after verifying that the node has the correct DNS entries deployed in a testing model.

Hi @derekcat @jaimesouza @moula I was reviewing your comments.
If you check the link shared by @taurus-forever, I believe this is the part of the code affecting you.

One thing I noticed from your prints is that bind-address is not defined in your outputs of show-unit.

Do you mind running this command:

# Juju 2.9.x
juju run --unit mysql/0 -- network-get database-peers
# Juju 3.1.x
juju exec --unit mysql/0 -- network-get database-peers

Bonjour @phvalguima
I use juju 2.9
Capture d’écran du 2023-09-19 15-51-38

Thank you.
Moula.

Bonjour @jaimesouza As you know it's a bug with mysql and ip...
I'm waiting for this to be fixed and tested by the Canonical QA team.
II would like to thank you for your reaction and answers and the whole team.
I'm going to leave you calm for a little while to get back to the Kubeflow+Mlflow+cos projects..
Merci beaucoup.
Moula.

Hi @moula! No problem!
Glad to help!
If you need any assistance, just let us know.

Hey @moula!

Have the issues been resolved? If so, how did you do it?

I am having the same issue on MySQL charm on vSphere.

Bonsoir @jaimesouza
Sorry but the issue is still there.
Capture d’écran du 2023-10-20 19-47-17
Capture d’écran du 2023-10-20 19-46-33
Capture d’écran du 2023-10-20 19-43-08

Hey @moula!

@NucciTheBoss from Canonical is also investigating this issue.

I will let you know when we have a solution for it.

Thank you!

Hi there @moula - are you sure that you have DHCP configured correctly in your MAAS cluster? Reason saying is that I am looking through the screenshots that you have shared and it seems that your nodes are having trouble connecting to the network. For example, I notice in your attached screenshots that your nodes are failing to pull the SLURM packages from Omnivector's PPA on Launchpad. It's likely that they are unable to connect to the external network. What's likely happening with MySQL is that it is defaulting to /etc/hosts because it is either unable to contact your DNS server, or the node has not been assigned an IP by your DHCP server.

Could you both (including @jaimesouza) make sure that you have your DHCP and DNS servers configured correctly and that your deployed nodes can reach the external network. e.g. does ping ubuntu.com successfully transmit and receive ICMP packets? I'm not able to reproduce these issues you're having on either LXD, OpenStack, AWS, or GCP, and they usually include their own DNS and DHCP services.

See section 10.4 of the Debian networking reference for an explanation on why MySQL is resolving to 127.0.1.1 if you do not have an assigned IP address for your instance.

Hi there @moula - are you sure that you have DHCP configured correctly in your MAAS cluster? Reason saying is that I am looking through the screenshots that you have shared and it seems that your nodes are having trouble connecting to the network. For example, I notice in your attached screenshots that your nodes are failing to pull the SLURM packages from Omnivector's PPA on Launchpad. It's likely that they are unable to connect to the external network. What's likely happening with MySQL is that it is defaulting to /etc/hosts because it is either unable to contact your DNS server, or the node has not been assigned an IP by your DHCP server.

Looking at the error reported from the slurmd/0 unit (shown in the first 2 screen captures provided in this comment, I am curious if there's a network proxy in the way. Can you comment on whether this is the case or not? If so, have you set any model configuration options in Juju for the proxies? Additionally, I will note that the add-apt-repository command for resolving the PPA URL in the short format will make queries out to the launchpad APIs. If these are restricted by local network settings, it will cause these queries to fail and be unable to resolve the target URLs for adding the apt-source strings. This is something that can, and should be, handled by the slurm charms here.

Additionally, the error reported by the mysql-router/0 unit is indicating that the charmed-mysql snap is already installed. The installation hook is fairly simple, where the primary purpose is to install the snap itself. However, perusing the code here, I can see a scenario when the asynchronous task of installing the snap fails to complete before the charm logic times out the installation ultimately causing the hook to error out. Upon re-running the installation hook, there's a simple check that the mysql-router charm is using to see if the snap is already installed. However, it doesn't take into account that the snap may already be installed and re-executing on a failed install hook scenario. This may or may not be by design, but I'll raise a bug over against the mysql-router charm to be able to handle this scenario and let them weigh in on it (cc @taurus-forever).

These stack traces provided don't capture the mysql-operator logs themselves. However, the juju status output provided in the 3rd screenshot clearly indicates that the mysql service was unable to be configured which matches the original condition. Ultimately, I think there's a few things that need to be cleared up here - but the data there doesn't help identify what the problem is in the mysql-operator case.

@NucciTheBoss Hi Jason. Thank you for your message and your work. I just want to tell you that on this platform, I have 15 bundles that have always worked. Sunbeam bundle, openstack bundle, microk8s bundle, microceph, Charmed-k8s, kubeflow, mlflow, cos, postgresql, mysql... Even the old hpc bundle on focal worked, so you understand what I mean. This is the only bundle that no longer deploys since you replaced percona with mysql-8. one bundle out of 15 which is no longer deployed on my platform with 25 Asus physical servers!!!. Even the mysql bundle doesn't deploy on its own, I sent the bug to the mysql bundle team some time ago but I managed to deploy it by individual charm, to tell you the truth. Something else I tested since yesterday on juju-controller 2.9 and on juju-controller 3.1 but the problem is the same. even now I'm still testing.
Capture d’écran du 2023-10-25 01-26-53
Capture d’écran du 2023-10-25 01-24-55
Capture d’écran du 2023-10-25 01-24-37
Capture d’écran du 2023-10-25 01-24-10
Capture d’écran du 2023-10-25 01-23-42

@NucciTheBoss The IP of the mysql node is fixed at the maas level. 192.168.70.130 && 172.200.0.130.
Capture d’écran du 2023-10-25 01-47-10

I restarted the live deployment but the issue is still the same :
Capture d’écran du 2023-10-25 01-51-37
Capture d’écran du 2023-10-25 01-51-10

Hi there @moula - are you sure that you have DHCP configured correctly in your MAAS cluster? Reason saying is that I am looking through the screenshots that you have shared and it seems that your nodes are having trouble connecting to the network. For example, I notice in your attached screenshots that your nodes are failing to pull the SLURM packages from Omnivector's PPA on Launchpad. It's likely that they are unable to connect to the external network. What's likely happening with MySQL is that it is defaulting to /etc/hosts because it is either unable to contact your DNS server, or the node has not been assigned an IP by your DHCP server.

Looking at the error reported from the slurmd/0 unit (shown in the first 2 screen captures provided in this comment, I am curious if there's a network proxy in the way. Can you comment on whether this is the case or not? If so, have you set any model configuration options in Juju for the proxies? Additionally, I will note that the add-apt-repository command for resolving the PPA URL in the short format will make queries out to the launchpad APIs. If these are restricted by local network settings, it will cause these queries to fail and be unable to resolve the target URLs for adding the apt-source strings. This is something that can, and should be, handled by the slurm charms here.

Additionally, the error reported by the mysql-router/0 unit is indicating that the charmed-mysql snap is already installed. The installation hook is fairly simple, where the primary purpose is to install the snap itself. However, perusing the code here, I can see a scenario when the asynchronous task of installing the snap fails to complete before the charm logic times out the installation ultimately causing the hook to error out. Upon re-running the installation hook, there's a simple check that the mysql-router charm is using to see if the snap is already installed. However, it doesn't take into account that the snap may already be installed and re-executing on a failed install hook scenario. This may or may not be by design, but I'll raise a bug over against the mysql-router charm to be able to handle this scenario and let them weigh in on it (cc @taurus-forever).

These stack traces provided don't capture the mysql-operator logs themselves. However, the juju status output provided in the 3rd screenshot clearly indicates that the mysql service was unable to be configured which matches the original condition. Ultimately, I think there's a few things that need to be cleared up here - but the data there doesn't help identify what the problem is in the mysql-operator case.

Hi @wolsen No, it was just an oversight in this image. Since then I have tried again several times as currently live.

@moula thank you for refreshing your deployment; clears up some of my initial confusion. Can you ssh into each of the nodes to ensure that they can access the external network?

For example, could you try contacting ubuntu.com from mysql/0:

juju ssh mysql/0
ping ubuntu.com

Can you then verify that mysql/0 can reach slurmdbd/0?

ping <ip_for_slurmdbd/0>

I just want to ensure that your nodes can reach the network. This will help narrow down where the issue is located within the cluster.

@NucciTheBoss I have another error message, after rebooting the nodes as requested during the installation of the jammy OS.

Capture d’écran du 2023-10-25 02-26-25

@NucciTheBoss I have a question :
After modifying the ips of the nodes in the file: /etc/cloud/templates/hosts.debian.tmpl and restarting the nodes. Are there any commands to force the bundle deployment to re-execute?
moula@maas:~/slurm-bundles$ juju deploy ./slurm-core/bundle.yaml --overlay ./slurm-core/charms/latest-edge.yaml --force
No changes to apply.

@moula

The issue is on /etc/hosts indeed. For some reason, my vSphere VM comes with the following line:

127.0.1.1 juju-39c062-25 juju-39c062-25

You have the same thing.

I have tested pre-adding the machine, commenting that line, rebooting the machine, and deploying the charm on it after that.

It deployed successfully!

I need to make the changes on /etc/cloud/templates/hosts.debian.tmpl to make it persistent after reboot.

Thanks @jaimesouza for testing this out. This can be a stop-gap patch in the short term; however, we should identify what the root cause of the issue is.

We're looking at what the issue could possibly be within the MySQL operator. It's obvious now that the nodes in both your clouds are being assigned IPs via DHCP and able to contact the network, so we need to figure out why the MySQL operator is resolving the IP to be 127.0.1.1. Don't want to patch /etc/hosts if we don't have to 😅

@NucciTheBoss I have a question :
After modifying the ips of the nodes in the file: /etc/cloud/templates/hosts.debian.tmpl and restarting the nodes. Are there any commands to force the bundle deployment to re-execute?
moula@maas:~/slurm-bundles$ juju deploy ./slurm-core/bundle.yaml --overlay ./slurm-core/charms/latest-edge.yaml --force
No changes to apply.

I don't think so. Generally you shouldn't need to reexecute the bundle deployment as that goes against the autonomous nature of Juju. Juju will automatically attempt to rerun failed hooks.

If you have failed hooks in your cluster, you can follow this documentation here for how to retrigger the hook: https://juju.is/docs/sdk/debug-a-charm#heading--debug-a-single-failing-hook. This documentation is more meant for debugging, but you can use it to rerun hooks should you need too!

@moula

The issue is on /etc/hosts indeed. For some reason, my vSphere VM comes with the following line:
e
127.0.1.1 juju-39c062-25 juju-39c062-25

You have the same thing.

I have tested pre-adding the machine, commenting that line, rebooting the machine, and deploying the charm on it after that.

It deployed successfully!

I need to make the changes on /etc/cloud/templates/hosts.debian.tmpl to make it persistent after reboot.

@jaimesouza Thank's.
I did the same thing with the mysql node and added the ip fix to it once the ubuntu OS was installed and rebooted but for the moment it doesn't work, hence my last question to @NucciTheBoss. I hope that with your modifications it will work. Even they are making changes to the mysql-router charm.
Thank you to all of you.

hi guys. I think our bug will be fixed soon. I will test it as soon as it is applied. THANKS.
canonical/mysql-router-operator#84.
Moula.