Issues and notes for large lab provisioning

Question

Issues and notes for large lab provisioning

Opened this issue 2 years ago · 0 comments

displague commented 2 years ago

Address, document, or create new issues for the following large lab deployment observations:

SSH provisioning for create_cluster should be idempotent. If tainted, this resource should be safe to rerun and should succeed. This means clearing out the existing kind cluster and clearing other partial cluster creation states.
Cloud-init success should be a server creation requirement. Depends on equinix/terraform-provider-equinix#214. Toggling network state to hybrid could be delayed until cloud-init is done. This would remove the need for the wait-for-cloud-init provisioner.
^ contingent on above. Move as much ssh provisioning as possible into cloud-init. Note: create-cluster relies on Layer2 being ready, so we would need a way to wait until it was up before starting the create-cluster segment.
DP and CP nodes should not request Public IP addresses. Reduce the risk of a device provision failing due to IP limits. We don't use dynamically allocated public IPs for these nodes, we use gateway IP reservations.
Metal Gateway deletion can error on 404s equinix/terraform-provider-equinix#257
#49
Reduce the "Create Cluster" provision timeout to ~20m. Waiting 60m for a failure does not yield better results.
Provisions are lock-step and errors and success states are not identified until all nodes are ready to move on to the next step (find ways to increase parallelization and independence).
Document that local Socket constraints (SSH, SSH Agent) can cause large provisions to fail. Offer guidance on how many are possible or how to increase the number.
On destroy, Metal VLANs may fail to delete while devices are being deprovisioned
On destroy, Projects may fail to delete when devices are pending deletion
Outputs should be provided in project-collaborator and lab projects
check.sh should be created to effectively ssh into each eksa-admin node and verify cluster node count, eksa-add-on node should also be verified to be heathy to the extent possible.
Metal CLI should be installed in eksa-admin with the Project API key preconfigured in ~/.config/equinix/metal.yaml
AWS CLI should be preinstalled
#46
Include more details in README.md about the existing nodes (the IP address of all four nodes), although kubectl get nodes -o yaml offers the same.
Need additional verified plans #33
eksa-admin node should be configurable in lab (it does not need to match cp and dp types and could use a more available and smaller plan size)
When noting plans that are verified, note where there may be conflicting configurations (example, m3.small.x86, but not in FR where NIC identifier differs) (related to #56)
In the README.md "failures" section, note more of the common failure scenarios and the best way to deal with them in large deployments. (When should the replace script be used, when should a secondary lab workspace be created, when should state be removed)
Include a state removal helper #54
replace.sh should accept a second argument to filter the replacement to admin, cp, dp, or addon nodes. These specific values don't have to be hardcoded, the second parameter could be a secondary module grep pattern.