OpenShift-PSAP CI Artifacs
This repository contains Ansible roles and playbooks for OpenShift PSAP CI.
Performance & Latency Sensitive Application Platform
Quickstart
Requirements: (localhost)
- Ansible >= 2.9.5
- OpenShift Client (
oc
) - A kubeconfig config file defined at
KUBECONFIG
CI testing of the GPU Operator
The main goal of this repository is to perform nightly testing of the GPU Operator. This consists in multiple pieces:
- a container image definition;
- an [entrypoint script](for the container image) that will run in the container image;
- a set of config files and associated jobs for PROW CI engine.
See there for the nightly CI results.
As an example, the nightly tests currently run commands such as:
run gpu-operator_test-operatorhub # test the GPU Operator from OperatorHub installation
run gpu-operator_test-master-branch # test the GPU Operator from its `master` branch
run gpu-operator_test-helm 1.4.0 # test the GPU Operator from Helm installation
These commands will in-turn trigger toolbox
commands, in order to
prepare the cluster, install the relevant operators and validate the
successful usage of the GPUs.
The toolbox
commands are described in the section below.
GPU Operator toolbox
See the progress and discussions about the toolbox development in this issue.
GPU Operator
- Deploy from OperatorHub
- allow deploying an older version openshift-psap#76
toolbox/gpu-operator/deploy_from_operatorhub.sh [<version>]
toolbox/gpu-operator/undeploy_from_operatorhub.sh
- [x] List the versions available from OperatorHub (not 100%
reliable, the connection may timeout)
toolbox/gpu-operator/list_version_from_operator_hub.sh
Usage:
toolbox/gpu-operator/list_version_from_operator_hub.sh [<package-name> [<catalog-name>]]
toolbox/gpu-operator/list_version_from_operator_hub.sh --help
Defaults:
package-name: gpu-operator-certified
catalog-name: certified-operators
namespace: openshift-marketplace (controlled with NAMESPACE environment variable)
- Deploy from helm
toolbox/gpu-operator/list_version_from_helm.sh
toolbox/gpu-operator/deploy_from_helm.sh <helm-version>
toolbox/gpu-operator/undeploy_from_helm.sh
- Deploy from a custom commit.
toolbox/gpu-operator/deploy_from_commit.sh <git repository> <git reference> [gpu_operator_image_tag_uid]
Example:
toolbox/gpu-operator/deploy_from_commit.sh https://github.com/NVIDIA/gpu-operator.git master
- Wait for the GPU Operator deployment and validate it
toolbox/gpu-operator/wait_deployment.sh
- Run GPU-burn to validate that all the GPUs of all the nodes can run workloads
toolbox/gpu-operator/run_gpu_burn.sh [gpu-burn runtime, in seconds]
- Capture GPU operator possible issues (entitlement, NFD labelling, operator deployment, state of resources in gpu-operator-resources, ...)
toolbox/entitlement/test.sh
toolbox/nfd/has_nfd_labels.sh
toolbox/nfd/has_gpu_nodes.sh
toolbox/gpu-operator/wait_deployment.sh
toolbox/gpu-operator/run_gpu_burn.sh 30
toolbox/gpu-operator/capture_deployment_state.sh
or all in one step:
toolbox/gpu-operator/diagnose.sh
- Uninstall and cleanup stalled resources
helm
(in particular) fails to deploy when any resource is left from a previously failed deployment, eg:
Error: rendered manifests contain a resource that already exists. Unable to continue with install: existing resource conflict: namespace: , name: gpu-operator, existing_kind: rbac.authorization.k8s.io/v1, Kind=ClusterRole, new_kind: rbac.authorization.k8s.io/v1, Kind=ClusterRole
- This command ensures that the GPU Operator is fully undeployed from the cluster:
toolbox/gpu-operator/cleanup_resources.sh
NFD
- Deploy the NFD operator from OperatorHub:
toolbox/nfd/deploy_from_operatorhub.sh
toolbox/nfd/undeploy_from_operatorhub.sh
-
Control the channel to use from the command-line
-
Test the NFD deployment
- test with the NFD if GPU nodes are available
- wait with the NFD for GPU nodes to become available
toolbox/nfd/has_nfd_labels.sh
toolbox/nfd/has_gpu_nodes.sh
toolbox/nfd/wait_gpu_nodes.sh
Cluster
- Add a GPU node on AWS
./toolbox/cluster/scaleup.sh
- Specify a machine type in the command-line, and skip scale-up if a node with the given machine-type is already present
./toolbox/cluster/scaleup.sh <machine-type>
- Entitle the cluster, by passing a PEM file, checking if they should be concatenated or not, etc. And do nothing is the cluster is already entitled
toolbox/entitlement/deploy.sh --pem /path/to/pem
toolbox/entitlement/deploy.sh --machine-configs /path/to/machineconfigs
toolbox/entitlement/undeploy.sh
toolbox/entitlement/test.sh [--no-inspect]
toolbox/entitlement/wait.sh
- Capture all the clues required to understand entitlement issues
toolbox/entitlement/inspect.sh
- Deployment of an entitled cluster
- already coded, but we need to integrate this repo within the toolbox
- deploy a cluster with 1 master node
CI
- Build the image used for the Prow CI testing, and run a given command in the Pod
Usage: toolbox/local-ci/deploy.sh <ci command> <git repository> <git reference> [gpu_operator_image_tag_uid]
Example: toolbox/local-ci/deploy.sh 'run gpu-ci' https://github.com/openshift-psap/ci-artifacts.git master
toolbox/local-ci/cleanup.sh