jenkinsci/openstack-cloud-plugin

Launch several virtual machines simultaneously

Opened this issue · 9 comments

Hi, @olivergondza

We encountered the following problem while using your plugin: new virtual machines are launched one by one, even when build queue is full.

Steps to reproduce:

  • Two templates of slaves: CENTOS & REDHAT
  • Settings are set in "Default slave options"
  • Max. No. of Instances = 50
  • Min. No. of Instances = 3
  • Launch 15 long-running jobs

Observed behaviour:

  • 6 job start on existing agents
  • 9 jobs are hanging in the queue
  • new agents are launched one-by-one, new vm is created only after a previous one is complete

What we want:

  • some parameter, which allows us to set up maximum number of agents to be launched simultaneously
    or
  • the way to configure such behaviour by some combination of existing parameters

Below are the logs showing that creation of 3 instances is planned, but they are launched one-by-one.

2020-02-18 08:26:51.501+0000 [id=11476] INFO j.p.o.c.JCloudsPreCreationThread#execute: Pre-creating 3 instance(s) for template rhel7 in cloud cloud.xxx.com

2020-02-18 08:26:51.513+0000 [id=11476] INFO j.p.o.c.JCloudsSlaveTemplate#provisionServer: Provisioning new openstack server rhel7-162 with options jenkins.plugins.openstack.compute.SlaveOptions@517f51dd[...]

2020-02-18 08:27:21.872+0000 [id=11476] INFO j.p.o.compute.JCloudsSlave#createComputer: Creating a new computer for rhel7-162

2020-02-18 08:27:21.886+0000 [id=11476] INFO j.p.o.c.JCloudsSlaveTemplate#provisionServer: Provisioning new openstack server rhel7-163 with options jenkins.plugins.openstack.compute.SlaveOptions@5723a58f[...]

2020-02-18 08:27:54.541+0000 [id=11476] INFO j.p.o.compute.JCloudsSlave#createComputer: Creating a new computer for rhel7-163

2020-02-18 08:27:54.561+0000 [id=11476] INFO j.p.o.c.JCloudsSlaveTemplate#provisionServer: Provisioning new openstack server rhel7-164 with options jenkins.plugins.openstack.compute.SlaveOptions@457210bd[...]

Thanks in advance.

It is strange you cannot get to anything faster than one machine per ~30 seconds. The heavy lifting is supposed to be done in parallel so there either is something that causes that to be dispatched in quasi-serial way (which is fixable in plugin) or Jenkins heuristics for node planning are not requesting the capacity any faster than once per ~30 seconds (which is hard to fix).

Can you attach "Remaining excess workload" logs from NodeProvisioner and "Provisioning slave for" from JCloudsCloud (Fine logging needs to be turned on)? If NodeProvisioner is requesting capacity one by one, we know it is the core.

[1]

LOGGER.fine("Provisioning slave for " + label + " from template " + template.getName());

@olivergondza, thank you for quick answer!
We turned on 'FINE' logging level but unfortunately can't see the lines you mentioned in our logs.
Please check attached file.

provisioning.log

Hmm, what version of the plugin are you using? The line is there since 2.46...

We currently have version 2.51

@olivergondza, today we installed version 2.52-SNAPSHOT (private-4acc8024-jenkins).
After that we repeated experiment and finally caught that message, but only once and it looked so:

Provisioning slave for null from template rhel7

You can see it in attached XML: log.xml.txt

When we tried once more - there didn't see this message anymore.
Maybe we have configured something wrong? If it is important, we used web interface, not jcasc.

@Ezhvsalate, I see no logs from the NodeProvisioner there. In order to do a reasonable measurement, it is important those machines are scheduled based on queue demand. Then we have to evaluate whether the plugin invoked one node at a time or it is slow provisioning the bulk.

@olivergondza, ok, attached log with NodeProvisioner below.
provisioning_with_node_prov.log

@Ezhvsalate, I am sorry, but the log only contain one slave provisioning attempt (we need a lot more to assess why they are not dispatched fast when in bulk). I have retested in my env, and I have witnessed sub-second delays between "Provisioning new openstack server..." messages (meaning the servers are requested lot fasted than for you) and the "Provisioned: NovaServer..." interleaved with the previous ones confirming machines are provisioned in parallel. So in order to investigate in your environment, it is needed to:

  • Eliminate available Jenkins nodes not to interfere with the measurement.
  • Configure the UI logging like this[1].
  • Rapidly schedule at least 10 items in the queue that run on openstack provided nodes.
  • Wait until all nodes are running the builds.
  • Attach the entire log produced (this cannot be postponed as the log will be rotated eventualy)

Then we shall see.

[1]
screenshot

@olivergondza, when I carefully followed your instructions above it was ok - agents were starting simultaneously. Here is the log:
openstack_fast_creating_agents.log

But after that I returned back "Min. No. of Instances" setting and the issue was back :)

So, the following steps should be completed to to reproduce issue:

  1. Set the "Min./Max. No. of Instances" settings in "Default slave options" section to "3 / 50":
    image
  2. Wait until 3 agents (in our case -6, because there are two different types) are created one-by-one.
  3. Launch 15 builds. 6 will be started, 9 will fall in queue.
  4. Observe one-by-one creation of new agents:
    Openstack-one-by-one-creation.log