mwrock/kitchen-nodes

Kitchen converge race condition

jeremyciak opened this issue · 10 comments

I know this repository hasn't been contributed to in a long time, and I have also adopted scalp42 fork myself due to compatibility issues, BUT... I'm hoping that posting this issue may get some visibility from someone who may have some suggestions or answers.

What I am experiencing has to do with platform composition and the race condition that manifests because of it. I have a test kitchen suite that I am running where I have 1 Windows Server 2016 node and 3 Windows Server 2019 nodes. For whatever reason the Windows Server 2016 node will blast through the first aspects of a kitchen converge so that it gets to the "Preparing nodes" step before any of the Windows Server 2019 nodes have populated any node data. This causes any node search functionality from the Windows Server 2016 node to fail since there is no node data for the other nodes.

My current attempt at a "solution" is to simply run the converge an initial time, wait for it to fail, and then run converge again. I had not hit this race condition previously and I was always thinking previously that the node data was populated at the end of the kitchen create action and not at the start of the kitchen converge action. I am assuming that with my "solution" the first converge will populate all of the node data and fail, and then the next converge will succeed.

Is there a way to update test kitchen code anywhere/anyhow so that the node data is populated at the end of the kitchen create action and not at the start of the kitchen converge action?

When this happens the race condition manifests itself:

image

When this happens everything works:

image

I'll try to take a look next week @jeremyciak

@scalp42 Awesome! ...and thank you! I have rudimentary ability to read and write Ruby code, but I have no idea how to do underlying Chef code development to test whether anything I would write is actually functional in that realm. If you're able to point me in the right direction there I would love to help out.

@scalp42 Please let me know if/when you get a chance to assist here. I'm desperate!

Looking at it, it doesn't seem to be possible because some of the stuff like the ipaddress or recipes are discovered during converge time.

That being said, I believe you should be able to create the JSON files in advance.

Is it not working?

What do you mean about creating the JSON files in advance? You mean manually define them? Or you mean basically implement what this provisioner is doing in an out-of-band method?

And what are you referring to specifically not working?

So I believe it's working as intended. This plugin does it both ways.

Here's an example on Ubuntu 18.04 with a simple recipe:

suites:
  - name: node1
    run_list:
      - recipe[example::search]
tag 'jeremyciak'

nodes = search(:node, 'tags:jeremyciak').sort

if nodes.empty?
  Chef::Log.info %|#{cookbook_name} => Could not find node matching "tags:jeremyciak".|
else
  Chef::Log.info %|#{cookbook_name} => The following nodes were found:|
  nodes.uniq.each { |n| Chef::Log.info "- #{n.name} (ip: #{n['ipaddress']})" }
end

Now if I just run converge, it won't find any node matching tags:jeremyciak.

But if I drop a simple JSON with whatever needed to make it work (here we just care about the tags attribute at the normal level):

// Place this file under cookbook_name/test/nodes/node1-ubuntu-1804.json
{
  "name": "node1-ubuntu-1804",
  "chef_environment": "kitchen",
  "normal": {
    "run_completed": false,
    "tags": [
      "jeremyciak"
    ]
  }
}

Now if I destroy and converge again:

       Compiling Cookbooks...
       [2020-09-08T18:31:55+00:00] INFO: example => The following nodes were found:
       [2020-09-08T18:31:55+00:00] INFO: - node1-ubuntu-1804 (ip: )
       Converging 0 resources
       [2020-09-08T18:31:55+00:00] INFO: Chef Infra Client Run complete in 0.344096343 seconds

Notice that the ipaddress was not found as we didn't specify it in the JSON.

But if you run converge again, kitchen-nodes will then populate the missing data needed for other nodes or anything you want.

Yes, over my time troubleshooting this situation I have become intimately familiar with how this provisioner should operate. The issue I am wondering if we can fix is the dependency on the node data for all relevant nodes being generated prior to any of the nodes pulling down the generated node data. I am seeing a race condition where one of my nodes generates its node data and pulls it down before any of the other nodes have generated their node data. This results in that node being unable to find any of the other nodes. I'm wondering if we can add some kind of context awareness or something to wait until all nodes have produced their node data before continuing the converge actions. I don't know whether this can live within this provisioner or if this would have to be implemented in the test kitchen code somewhere.

The use case I have is to orchestrate nodes for a Microsoft Windows Remote Desktop infrastructure deployment (RD Broker(s), RD Gateway(s), RD Web Server(s), RD Host(s)). I need accurate networking data populated so that these nodes can be referenced with PowerShell/DSC and create a deployment from them. The other issue I have is that our development and CI utilize different networks so I rely on DHCP to hand off the IP addresses and then reference those IP's in the node data that this provisioner allows me to dynamically produce. I can't manually specify node data related to networking or it will break either my development process or my CI process.

Unfortunately, I'm not sure I can help more with your environment. Someone else might be able to chime in.

In general, if I really have to rely on "time", you could also sleep in your code with a bunch of tries if you're assuming that the node data will be there eventually (related to a search for example).

Yeah, my recipes already have a bunch of retries and waits to account for other orchestration stuff. The issue here is that this problem manifests before my recipes are even relevant so I have no control over this from within my Chef recipe code.