metal3-io/baremetal-operator

Provision profiling

sebastiaopamplona opened this issue · 6 comments

User Story

As an operator I would like to profile the provision process, for benchmarking purposes.

Detailed Description

I'm using qcow2 images, and I have an image with ~600M and another with ~2.2G. The first one takes roughly 3m, while the second takes about 20m. The time here is not proportional, and I'm aware there are image details that can affect provision speed.

Anything else you would like to add:

  1. Is there a way to profile the provision process?
  2. What are the criteria for moving a server to the provisioned state?

/kind feature

This issue is currently awaiting triage.
If Metal3.io contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.
The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Rozzii commented

Answer to question one:
There isn't :

  • it could involve 2 machine restarts
  • multiple IPA workflows
  • cloud-init configuration actions after the machine was rebooted with the node image
  • kubeadm registration process

Because of the multiple restarts there is no permanent agent (e.g. a daemon) that would measure the time it takes to do all the provisioning activities.

On the other hand Ironic provides a Prometheus endpoint (I have never used it) so from that you might be able to get some data.
There is an other possibility to log into the Node via ssh while the IPA ramdisk is executing the deployment steps and look at the logs to see what if there is any issue.
Ironic also has automatic log collection option for deployment.

Answer to the second question: Machine has to boot after a successful Ironic provisioning process and register itself to the K8s cluster.

Thanks for the reply @Rozzii ; I have a few follow up questions:

There is an other possibility to log into the Node via ssh while the IPA ramdisk is executing the deployment steps and look at the logs to see what if there is any issue.

Which steps does the IPA ramdisk execute to start image provisioning?

Machine has to boot after a successful Ironic provisioning process and register itself to the K8s cluster.

What do you mean by register itself to the K8s cluster? Imagining I'm using CentOS 9 for the IPA ramdisk and I provision Ubuntu 22.04; the server reaches the Provisioned state successfully. Does the server send anything to the BMO, from the Ubuntu 22.04?

What would you recommend to speed up provisioning times?


Sorry if this is documented somewhere, I did not find it. If there's some docs that answer these questions, could you please point me to them? Thanks!

  1. From BMO perspective it is the "provisioning" from Ironic perspective it is the "deploy" step

  2. From BMO perspective the "provisioning" is done after the IPA reports back successful "deploy step" to ironic but from CAPI perspective a "Machine" is provisioned successfully when it joins the K8s cluster. Don't forget we are running CAPI kubeadm bootstrap provider that will run kubeadm and other scripts with the help of cloud-init after IPA has written the image to disk and restarted the machine.

  • Use the minimal amount of phsysical disks in the machine to speed up "pre-deployment" cleanup
  • Use "fast-track" option of Ironic and IPA to skip machine restart between IPA "inspection" and "deploy" steps
  • If you are using custom IPA use as small IPA image as possible
  • Use as small Node image as possible
  • Make sure you have fast network connection between IPA and the server that stores your node image
  • Use redfish-virtualmedia boot if possible to skip the complexities and 1extra restart introduced by PXE bootin

These are the things that I remembered at the moment.

/close
There were no discussion for two months so I will assume my answers were sufficient.

@Rozzii: Closing this issue.

In response to this:

/close
There were no discussion for two months so I will assume my answers were sufficient.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.