A tekton pipeline that will build the image from Dockerfile, push the image to quay registry, pull that image and run the cuda code on GPU. Cuda code that is used here is the matrix multiplier one that was written by Diane Feddema from AICoE at Red Hat and modified by Selbi Nuryyeva to work with Tensorflow 2.X. Part of the pipeline yamls were adopted from AICoE.
Pipeline consists of two tasks: mm-buildah
and mm-run
. mm-buildah
consists of two steps: build
and push
, while mm-run
consists of only one step: run
.
When pipeline is run, it first creates one pod for mm-buildah
task, which will have two containers initiated: one for build
step and another for push
step. After mm-buildah
task is complete, another pod will be created for mm-run
task with run
container. Please note that there would be other containers created within a pod for each task. They are intended for pulling images and other background processes.
Openshift Container Platform (tested on 3.11 only, but should work on 4.x as well)
Free 1-hour access is also available through learn.openshift.com
Tekton
Openshift Pipelines
Quay repository account and robot access
Create an account in quay.io. Once account is setup, on the top right click "Create New Repository" and create a "Container Image Repository" with a name of "matmul", set it to public, choose empty repository and click "Create Public Repository".
Now let's set up the robot that will allow access to repositories. On top right, click on username, then "Account Settings". On the left, click on the image of a robot, then on the right "Create Robot Account". Then, fill in "build" for a name and provide description if desired and click "Create Robot Account". Choose all repositories that need to be accessed. In this case, it is only "matmul", whose permissions need to be set "Write" and click "Add permissions". This robot now will facilitate push, pull access to your repository.
Login to openshift:
oc login -u admin
It will ask for your password. Create a new project called "matmul":
oc new-project matmul
Fork this repository on own Github account and then clone it into local machine where pipelines would be running. Please remember to change username to own:
git clone https://github.com/username/matmul-gpu.git
Create a service account called "matmul":
oc create sa matmul
Add the needed privileges to the service account to build, push and pull images:
oc adm policy add-scc-to-user privileged -z matmul
oc adm policy add-scc-to-user anyuid -z matmul
-z
refers to service account specifically.
Now let's set up the Quay registry access for the service account. In quay registry, click on your username --> Account Settings --> robot icon on the left. Click on the robot account name that was set up earlier and go to "Kubernetes Secret". Secret could either be downloaded to local machine that will be running the pipelines or viewed and copy-pasted in the machine via vi secret-file.yml
.
Before applying the secret, inside the yaml file change the name:
of the secret to "matmul-secret". Now let's apply the secret (if downloaded file, use the filename):
oc apply -f secret-file.yml
-f
refers to filename.
The creation of secret can be confirmed by running the command below and checking for the name:
oc get secret
Now we need to let the service account know the secret:
oc edit sa matmul
and add two identical lines that are shown below:
imagePullSecrets:
- name: matmul-secret
- name: ...
...
secrets:
- name: matmul-secret
- name: ...
and exit (ie. ESC
and :wq
)
Now it is good to go!
Go to the folder that was cloned where "full-pipeline.yml" file is. First, let's upload all pipeline resources, tasks, pipeline and request for persistent volume claim:
oc apply -f full-pipeline.yml
which should give output of:
pipelineresource.tekton.dev/mm-repo created
pipelineresource.tekton.dev/mm-build-image created
persistentvolumeclaim/mm-runtime-pvc created
task.tekton.dev/mm-buildah created
task.tekton.dev/mm-run created
pipeline.tekton.dev/matmul-pl created
And now let's start the pipeline:
oc apply -f pipeline-run.yml
Now the pipeline is running. It can be confirmed with:
oc get pr
As mentioned earlier, pipeline consists of two tasks. First task has two steps: build
and push
. Second task has only run
step.
To see progress, we can check the logs of those specific steps (each task is a separate pod and each step is a separate container).
First, check the pod name:
oc get pods
which will give something similar to:
NAME READY STATUS RESTARTS AGE
matmul-pr-build-g7gwd-pod-47196b 3/5 Running 0 2m2s
Status will change from Init:0/4
to PodInitializing
to Running
. Once running, each step needs to be monitored separately. Copy the name of the pod:
oc logs -f matmul-pr-build-g7gwd-pod-47196b -c step-build
And logs should appear. Remember to change the pod name to the one generated on the local machine. It could also be written to a file by adding > build-progress.log
.
Please note that the matmul.py code requires (and Dockerfile instructs installation of) Tensorflow 2.x which runs on CUDA 10.1. To have a working Tensorflow 2.x on CUDA 10.2, it needs to be installed manually (see tensorflow/tensorflow#38194 ).
Step push
and run
could be checked similarly. Remember that run
step will have a different pod and a pod name.
Once pipeline run is complete, check the logs of run
step and at the end, it should look similar to this:
Shape: (1500, 1500) Device: /gpu:0
Time taken: 0:00:02.346886
The pipeline-run has been completed! All tasks, pipelinerources, pipeline, pipeline-run and pvc can then be deleted if not needed anymore. ie:
oc delete tasks --all -n matmul
To perform further benchmarks, the shape of the matrix can be modified by changing the number in the following line in full-pipeline.yaml
under mm-run
Task:
command: ["python3", "matmul.py", "gpu", "1500"]