jackfrancis/kamino

In Azure, prototype based VMs sometimes trigger a IDNS delete of their donor name

Closed this issue · 7 comments

When a new VM starts, it learns its name via DHCP but sometimes that triggers a code path to delete the "old" name - the name that the donor VM had when the prototype image was created.

I have yet to find the key item to prevent that from happening other than the "full genericization" of the VM but that then requires cloud-init and CSEs to rerun (and they don't run correctly as machine looks like it has already run parts of them)

The current workaround to this problem is to sacrifice the donor node. This means that the name no longer exists and thus is never seen as something that could be deleted in IDNS.

How does a new VM derive this "old name" when it starts? I can't be from a reverse lookup, because DHCP will have given the VM a new, available IP address, for which there should be no PTR record.

Is the "old name" on disk in the OS layer somewhere? Or is there some other unique identifier dictionary in the Azure platform that knows about the "old name" that a derived VM inherits?

The old name is cached on the disk in a number of locations. This is what seems to trigger the issue but we have not found the correct item to fix up for that and not cause a problem. (I have one guess, but this is not a 100% problem so testing this has been tricky and time consuming)

With the new fix needed for the Machine ID, the workaround for the IDNS issue just became a required feature for the system as a whole.

The only other option would be to copy the disk to a new VM, start it, tweak it, and then end it (use that VM just for the tweak operations) and then the original donor can remain. I am not sure that this is worth the effort yet given the amount of time and failure paths that introduces.

We should investigate if the machine-id was related to the IDNS issue - I doubt it but it would be good to know for sure.
In any case, if we can find the magic file to clean up for this, we could resolve this issue using the same technology used to resolve #22 - Which would be nice to do as I don't like the need for the sacrifice (albeit, I also don't mind it)

I would prefer we drive this issue to resolution and not sacrifice the node.

I agree but it should not be a blocker for 1.0.0 - albeit we may have some time to find the minimal change needed to successfully solve this. That is the trick - I can solve it with a big change but that requires cloud-init to re-init the node again (which is one of the things we are trying to eliminate)

The problem was tricky but it is now solved and we get correct hostnames and no one asking to delete the hostname of the donor now. (The image no longer knows it had a prior hostname so it does not try to delete it)

See PR #69