Anti-affinity rule doesn't work

Question

Anti-affinity rule doesn't work

Closed this issue 5 months ago · 6 comments

I tried to implement anti-affinity rule but it doesn't work. Tag is identified, for example : plb_exclude_fw on 2 VM on same host, but when I ran the script, It doesn't move the VM :

<6> ProxLB: Info: [logger]: Logger verbosity got updated to: INFO.
<4> ProxLB: Warning: [api-connection]: API connection does not verify SSL certificate.
<6> ProxLB: Info: [api-connection]: API connection succeeded to host: 10.99.99.10.
<6> ProxLB: Info: [only-on-master-executor]: No master only rebalancing is defined. Skipping validation.
<6> ProxLB: Info: [node-statistics]: Added node Proxmox-3AZ-1-A.
<6> ProxLB: Info: [node-statistics]: Added node Proxmox-3AZ-1-B.
<6> ProxLB: Info: [node-statistics]: Added node Proxmox-3AZ-1-C.
<6> ProxLB: Info: [node-statistics]: Created node statistics.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM/CT tag from API.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM/CT tag from API.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM/CT tag from API.
<6> ProxLB: Info: [api-get-vm-include-exclude-tags]: Got PLB exclude group.
<6> ProxLB: Info: [vm-statistics]: Added vm FW2.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM/CT tag from API.
<6> ProxLB: Info: [api-get-vm-include-exclude-tags]: Got PLB exclude group.
<6> ProxLB: Info: [vm-statistics]: Added vm FW1.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM/CT tag from API.
<6> ProxLB: Info: [vm-statistics]: Added vm Test-VM-2.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM/CT tag from API.
<6> ProxLB: Info: [vm-statistics]: Added vm VM-Test-1.
<6> ProxLB: Info: [api-get-vm-tags]: Got VM/CT tag from API.
<6> ProxLB: Info: [vm-statistics]: Added vm VM-Test-3.
<6> ProxLB: Info: [vm-statistics]: Created VM statistics.
<4> ProxLB: Warning: [node-update-statistics]: Node Proxmox-3AZ-1-B is overprovisioned for disk by 164%.
<4> ProxLB: Warning: [node-update-statistics]: Node Proxmox-3AZ-1-B is overprovisioned for disk by 328%.
<4> ProxLB: Warning: [node-update-statistics]: Node Proxmox-3AZ-1-B is overprovisioned for disk by 492%.
<4> ProxLB: Warning: [node-update-statistics]: Node Proxmox-3AZ-1-B is overprovisioned for disk by 656%.
<4> ProxLB: Warning: [node-update-statistics]: Node Proxmox-3AZ-1-B is overprovisioned for disk by 820%.
<6> ProxLB: Info: [node-update-statistics]: Updated node resource assignments by all VMs.
<6> ProxLB: Info: [balancing-method-validation]: Valid balancing method: memory
<6> ProxLB: Info: [balancing-mode-validation]: Valid balancing method: used
<6> ProxLB: Info: [balanciness-validation]: Rebalancing for memory is not needed. Highest usage: 93% | Lowest usage: 87%.
<6> ProxLB: Info: [rebalancing-tags-group-exclude]: Create exclude groups of VM hosts.
<6> ProxLB: Info: [rebalancing-calculator]: Balancing calculations done.
<6> ProxLB: Info: [rebalancing-executor]: No rebalancing needed.
<6> ProxLB: Info: [cli-output-generator]: Start rebalancing vms to their new nodes.
<6> ProxLB: Info: [cli-output-generator]: No rebalancing needed.
<6> ProxLB: Info: [post-validations]: All post-validations succeeded.
<6> ProxLB: Info: [daemon]: Running in daemon mode. Next run in 1 hours.

I look at your code and I noticed for the method __get_vm_tags_exclude_groups, you have group_include instead group_exclude
def __get_vm_tags_exclude_groups(vm_statistics, node_statistics, balancing_method, balancing_mode):
""" Get VMs tags for exclude groups. """
info_prefix = 'Info: [rebalancing-tags-group-exclude]:'
tags_exclude_vms = {}
processed_vm = []

# Create groups of tags with belongings hosts.
for vm_name, vm_values in vm_statistics.items():
    if vm_values.get('**group_include**', None):
        if not tags_exclude_vms.get(vm_values['**group_include**'], None):
            tags_exclude_vms[vm_values['**group_include**']] = [vm_name]
        else:
            tags_exclude_vms[vm_values['**group_include**']] = tags_exclude_vms[vm_values['**group_include**']] + [vm_name]

Could you please check if the script work's for you ?

Regards.

Answer 1 · 2024-08-29T11:39:05.000Z

Hey @adminsyspro,

this didn't work because no rebalancing has been executed:

<6> ProxLB: Info: [balanciness-validation]: Rebalancing for memory is not needed. Highest usage: 93% | Lowest usage: 87%.
<6> ProxLB: Info: [rebalancing-tags-group-exclude]: Create exclude groups of VM hosts.
<6> ProxLB: Info: [rebalancing-calculator]: Balancing calculations done.
<6> ProxLB: Info: [rebalancing-executor]: No rebalancing needed.

This action will only be proceeded in parallel, when there's a reason to rebalance VMs. I would assume, that the VMs are in general already placed distinct from each other and honour them in that way when performing a rebalance action.

At the first look you seem to be right that there might be a typo - wondering how the tests passed. Will have a look at it asap. Thanks for reporting!

Cheers,
gyptazy

Answer 2 · 2024-08-29T11:45:48.000Z

Thank's for your fast reply :) !

Oh I understand, event if I have tag plb_exclude_fw applied on my two VM, if the ressources scheduler doesn't detect it need move VM because it well balanced, It doesn't apply the anti-affinity rule immediatly ?

However, I tried the opposite with affinity-rule, it works immediatly at start to grouping my VM. So, for anti-affinity it doesn't work the same ?

Regards.

Answer 3 · 2024-08-29T11:51:20.000Z

Hey @adminsyspro,

that might have been luck in that case that duo creating an affinity group the gap between the resources was maybe to huge and resulted in a rebalancing, while with this bug, nothing changed and therefore also all resources were still within the given balanciness range.

In general, this is only done when also a rebalancing occurs. Now, we could also think about changing the priorities because I only want to honour such things during a rebalance action - but it would also be legit to say I want to enforce the affinity/anti affinity rules (as a primary goal) in addition to rebalancing (secondary goal).

Maybe I can implement something like this as a new option (e.g. primary_action: affinity or balancing).

Why I don't want to treat this as a default is that such groups may have impacts on the whole cluster and result in a bigger movement. But like always, that heavily depends on the setup and what a user wants. So I think, an option for this might be the best approach to everyone. What do you think?

Cheers,
gyptazy

Answer 4 · 2024-08-29T11:55:08.000Z

Indeed, I understand the concept.

It could be great if we can choose primary_action like you said !

Your tool is great ! Keep improve it ;)

Just about pve-proxmoxlb-service-ui, how can I installe it ? I don't find procedure to install it on my cluster.

Answer 5 · 2024-08-29T13:00:08.000Z

Thank you!

The proxmox UI integration is currently discontinued and will be re-integrated at a later time again (the reason can be found here: #44).

Answer 6 · 2024-08-30T06:15:06.000Z

Hey @adminsyspro,

I added a fix with PR #68 - maybe you can give it a try. Take care, this is based on the new upcoming 1.0.3 release and requires the newly introduced config (already in main branch present).

It now evaluates all VMs/CTs assigned to an anti-affinity group, adds their nodes to a list. Gets for each VM/CT a random node, validates if this is already present for the evaluated anti-affinity group and its VMs/CTs. If the node is already used, it tries a new random one until it finds one. If one is found, this will be added to the list to make sure, the next VM evaluation won't also use it. This will be repeated (up to max. 30 tries). If we have more VMs assigned than we have nodes present, we need to deal with it somewhere.

Happy to hear if this fixes your issue.

Thanks,
gyptazy