Deploying Packit service or Stream service to Openshift

tl;dr How to deploy

Obtain all the necessary secrets.
Configure the deployment by creating a variable file in 'vars/' from a template as described in vars/README.
dnf install ansible origin-clients python3-openshift
[SERVICE={service}] DEPLOYMENT={deployment} make deploy (see vars/README).

By default, the playbook checks that the local copies of the deployment and secrets repositories are up to date, and the variable file used is up to date with the corresponding template.

For these checks to work, you need to set secrets_repo_url in the variable file used.

To disable all these check, set check_up_to_date to false in the variable file.

To only disable comparing the variable file to the template, set check_vars_template_diff to false.

What's in here

playbooks - Ansible playbooks.
roles - Ansible roles.
vars - Variable file(s). See vars/README.md.
Openshift - Openshift resource configuration files (templates).
secrets - secret stuff to be used from openshift/secret-*.yml.j2
scripts - devops scripts used in multiple repositories
cron-jobs - OpenShift cron jobs
containers - files used to build container images

Images

We build separate images for

service / web server - accepts webhooks and tasks workers
workers - do the actual work
fedora messaging consumer - listens on fedora messaging for events from Copr and tasks workers
Sandcastle - sandboxing

Production vs. Staging images

Separate images are built for staging and production deployment. Staging images are :stg tagged and built from main of packit-service, packit-service-fedmsg, sandcastle and dashboard. Production images are :prod tagged and built from stable branch of packit-service, packit-service-fedmsg, sandcastle and dashboard. To move stable branches to a newer 'stable' commit we use scripts/move_stable.py

Image build process

Images are automatically built and pushed to Quay.io by a Github workflow whenever a new commit is pushed to main or stable branch.

In each repo (which builds images) see

Actions tab
.github/workflows/*.yml for configuration of the Action/workflow

For more details about local builds see packit-service/CONTRIBUTING.md

Continuous Deployment

tl;dr: Newer images in registry are automatically imported and re-deployed.

Long story: We use ImageStreams as intermediary between an image registry (Quay.io) and a Deployment/StatefulSet. It has several significant benefits:

We can automatically trigger Deployment when a new image is pushed to the registry.
We can rollback/revert/undo the Deployment.

Image registry -> [1] -> ImageStream -> DeploymentConfig/StatefulSet

[1] This is automatic on stg, but not on prod. On prod, where we don't want the images to be imported as soon as they're built, we run a CronJob to import them (end hence re-deploy) at the day & time (currently at 2AM on Tuesday) we want.

Manually import a newer image

tl;dr; DEPLOYMENT=prod make import-images

Long story: If you need to import (and deploy) newer image(s) before the CronJob does (see above), you can do that manually:

$ oc import-image is/packit-{service|worker|dashboard|service-fedmsg}:<deployment>

once a new image is pushed/built in registry.

There's also 'import-images' target in the Makefile, so DEPLOYMENT=prod make import-images does this for you for all images (image streams).

Partial deployments

To run only the tasks related to some of the services, this way doing a partial deployment, you can set the TAGS environment variable before calling make. For example, to run only the tasks to deploy Redis and Redis Commander, run:

$ DEPLOYMENT=dev TAGS="redis,redis-commander" make deploy

Use make tags to list the currently available tags.

Reverting to older deployment/revision/image

DeploymentConfigs (i.e. service & service-fedmsg) can be reverted with oc rollout undo:

$ oc rollout undo dc/packit-service [--to-revision=X]
$ oc rollout undo dc/packit-service-fedmsg [--to-revision=X]

where X is revision number. See also oc rollout history dc/packit-service [--revision=X].

It's more tricky in case of StatefulSet which we use for worker. oc rollout undo does not seem to work with StatefulSet (even it should). So when you happen to deploy broken worker and you want to revert/undo it because you don't know what's the cause/fix yet, you have to:

oc describe is/packit-worker - select older image
oc tag --source=docker usercont/packit-worker@sha256:<older-hash> myproject/packit-worker:<deployment> And see the packit-worker-x pods being re-deployed from the older image.

Prod re-deployment

Trigger :prod images builds

Run scripts/move_stable.py to move stable branches to a newer commit.

Import images -> re-deploy

If you don't want to wait for it to be done automatically you can do that manually once the images are built (check Actions in each repo).

How do I test my changes?

docker-compose (quick & dirty)

There's a docker-compose.yml in packit-service. See Running packit-service locally for how to make that work.

oc cluster up (slow & better)

Because we run the service in OpenShift the more reliable way to test it is to run an Openshift cluster locally and deploy the service there. oc cluster up spawns the Openshift cluster. Create secrets/packit/dev/ (steal them from our secret repo). cd vars/packit; cp dev_template.yml dev.yml and in dev.yml set api_key to the output of oc whoami -t.

Run DEPLOYMENT=dev make deploy. That will also push locally built images (:dev) into the cluster's registry (make sure you have push_dev_images: true in vars/packit/dev.yml).

minishift

Similar to above 'oc cluster up' you can run minishift to get a local OpenShift cluster. In addition to the above, you need to use docker and oc from the minishift environment after you start minishift:

$ eval $(minishift docker-env)
$ eval $(minishift oc-env)
$ oc config use-context minishift

and then build worker & service images (make worker; make service in packit-service repo) with Docker, before you run DEPLOYMENT=dev make deploy.

Staging (quick & reliable & but don't break it)

If you're lazy and you're sure your changes won't do any harm, you can temporarily get hold of staging instance for that. Just build & push packit-worker and you can play.

in packit-service repo:
- make worker
- podman tag quay.io/packit/packit-worker:dev quay.io/packit/packit-worker:stg
- podman push quay.io/packit/packit-worker:stg
in deployment: DEPLOYMENT=stg make import-images

Once you're done you should revert to older image. Or it will be automatically replaced once a packit-service PR is merged.

Generating secrets for local packit-service deployment

Local deployment of Packit service needs some secrets which can be generated using the steps listed below:

Create dev directory under secrets
Create a new GitHub app or open existing one and download key file as secrets/dev/private-key.pem
Replace variables with your user specific values in roles/generate-secrets/vars/main.yml
Run the playbook make generate-local-secrets

Then, copy the secrets directory to your packit-service directory

Zuul

We have to encrypt the secrets, because we are using them in Zuul CI. This repository provides helpful playbook to do this with one command:

DEPLOYMENT=stg make zuul-secrets

How are the secrets encrypted?

Zuul provides a public key for every project. The ansible playbook downloads Zuul repository and pass the project tenant and name as parameters to encryption script. This script then encrypts files with public key of the project. For more information please refer to official docs.

Obtaining a Let's Encrypt cert using `certbot`

Certbot manual: https://certbot.eff.org/docs/using.html#manual

Please bear in mind this is the easiest process I was able to figure out: there is a ton of places for improvements and ideally make it automated 100%.

We are using multi-domain wildcard certificates for following domains:

*.packit.dev
*.stream.packit.dev
*.stg.packit.dev
*.stg.stream.packit.dev

In case the procedure bellow does not work, previously used http challenge can be used instead. Be aware that the http challenge approach is more complex, includes destructive actions and longer downtime.

TL;DR

Check prerequisites.
Run certbot to obtain the challenges.
Configure DNS TXT records based on values requested in 2.
Update secrets repository.
Re-deploy stg&prod.

Note: If certbot is executed against multiple domains, step 3. is repeated for each domain.

1. Prerequisites

Make sure the DNS is all set up:

$ dig prod.packit.dev
; <<>> DiG 9.16.20-RH <<>> prod.packit.dev
;; QUESTION SECTION:
;prod.packit.dev.		IN	A
;; ANSWER SECTION:
prod.packit.dev.	24	IN	CNAME	elb.e4ff.pro-eu-west-1.openshiftapps.com.
elb.e4ff.pro-eu-west-1.openshiftapps.com. 3150 IN CNAME	pro-eu-west-1-infra-1781350677.eu-west-1.elb.amazonaws.com.
pro-eu-west-1-infra-1781350677.eu-west-1.elb.amazonaws.com. 60 IN A 18.202.187.210
pro-eu-west-1-infra-1781350677.eu-west-1.elb.amazonaws.com. 60 IN A 54.72.5.59

Check if you have access to packit.dev domain in Google Domains.

Check/install certbot locally: dnf install certbot.

2. Run certbot to obtain the challenges

Run certbot:

$ certbot certonly --config-dir ~/.certbot --work-dir ~/.certbot --logs-dir ~/.certbot --manual --preferred-challenges dns --email user-cont-team@redhat.com -d *.packit.dev -d *.stream.packit.dev -d *.stg.packit.dev -d *.stg.stream.packit.dev

You will be asked to set TXT record for every domain requested:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Please deploy a DNS TXT record under the name
_acme-challenge.abcxyz.packit.dev with the following value:

123456abcdef

Before continuing, verify the record is deployed.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Press Enter to Continue

3. Update DNS record

Go to Google Domains and create/set the corresponding value: TXT record called _acme-challenge.abcxyz.packit.dev. If those records already exist (from previous run), don't create new records, just edit current ones (or first delete the old ones and then create new ones).

Wait till it's distributed - in another terminal watch nslookup and once it returns the configured value

[~/]$ watch -d nslookup -q=TXT _acme-challenge.abcxyz.packit.dev
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
_acme-challenge.packit.dev      text = "123456abcdef"

Authoritative answers can be found from:

Ctrl+c

Go to the terminal with certbot command waiting for your action and hit Enter.

Repeat this for all requested domains.

4. Update secrets repository

Copy certificates to secrets repository (prod & stg)

cp ~/.certbot/live/packit.dev/{fullchain,privkey}.pem <cloned secrets repo>/secrets/packit/prod/
cp ~/.certbot/live/packit.dev/{fullchain,privkey}.pem <cloned secrets repo>/secrets/packit/stg/
cp ~/.certbot/live/packit.dev/{fullchain,privkey}.pem <cloned secrets repo>/secrets/stream/prod/
cp ~/.certbot/live/packit.dev/{fullchain,privkey}.pem <cloned secrets repo>/secrets/stream/stg/

Push, create merge request and merge.

5.Re-deploy stg and prod environment:

packit service

DEPLOYMENT=stg make deploy TAGS=secrets
DEPLOYMENT=prod make deploy TAGS=secrets

stream service

SERVICE=stream DEPLOYMENT=stg make deploy TAGS=secrets
SERVICE=stream DEPLOYMENT=prod make deploy TAGS=secrets

Restart (scale down and up) packit-service, packit-dashboard and nginx for them to use the new certs.

How to inspect a certificate

If you want to inspect local certificates, you can use certtool (gnutls-utils package) to view the cert's metadata:

X.509 Certificate Information:
    Version: 3
    Serial Number (hex): 04f4864b597f9c0859260d88e04cfabfeeac
    Issuer: CN=R3,O=Let's Encrypt,C=US
    Validity:
        Not Before: Wed Feb 17 14:46:25 UTC 2021
        Not After: Tue May 18 14:46:25 UTC 2021

Repository cache

To shorten the cloning time and making it possible to clone big repositories (like kernel), the service supports caching git repositories.

To make it work, you need to:

Configure the cache in the service config:

# Path of the cache (mounted as a persistent volume in our case)
repository_cache: /repository-cache
# The maintenance of the cache (adding, updating) is done externally,
# not in the packit/packit-service code.
add_repositories_to_repository_cache: false

Since our infrastructure does not support shared volumes, we need to attach one volume with a cache to each worker and one to each corresponding sandbox pod.
- In the case of workers, this is done during the deployment.
- For sandboxes, there is an option in the service config (the environment variable needs to differ for each worker and is set during startup of the worker):

command_handler_pvc_volume_specs:
  - path: /repository-cache
    pvc_from_env: SANDCASTLE_REPOSITORY_CACHE_VOLUME

And finally, add some repositories to the cache.
- To add a repository, clone/copy the git repository to the /repository-cache directory. (Each project as a subdirectory.)
- Be aware, that all the volumes, both the ones used by workers and the ones used by sandbox pods, need to have the repository there.
- For small projects, you can clone the repository from the pod's shell.
- In the case of larger repositories (like kernel), you can clone the repository on your localhost and use oc rsync to copy the repository to the volume.
- For the worker's volume, you can oc rsh to the worker pod to get to the volume.
- To populate volumes attached to sandbox pods, you can wait for some action to be executed or create a new pod (e.g. oc create -f ./openshift/repository-cache-filler.yml using and adjusting the name of the volume in this definition) with the volume attached. This will block the creation of the sandbox pods because the volume can't be mounted from multiple pods, so don't forget to delete the pod after you finished populating the cache.

Monitoring of packit-service

Pushgateway

To record metrics from Celery tasks we are going to use Prometheus Pushgateway which is deployed in our cluster. It can collect the metrics from the workers and provide the /metrics endpoint for Prometheus. There is a Prometheus instance running in OpenShift PSI, which is going to scrape the /metrics endpoint and then it will be possible to visualize them. Therefore the /metrics endpoint needs to be publicly accessible - it is exposed on https://workers.packit.dev/metrics for metrics of the production instance and https://workers.stg.packit.dev/metrics for metrics of the staging instance. We use nginx (definition) to serve as a reverse proxy for the pushgateway, which enables us to allow only GET requests and forward these to pushgateway (workers can send POST requests internally).

PostgreSQL data migration

To write out the data from the database pg_dumpall command can be used (or pg_dump packit to dump only packit database). The command creates a file with SQL commands for restoring the database. The only impact of running pg_dump/pg_dumpall should be the increased I/O load and the long-running transaction it creates. To import the data psql command can be used.

Upgrade

When upgrading the database between major versions, the data can be incompatible with the new version.

We run Postgres in an Openshift pod, so the process to migrate the data can be to create a new pod (it is important to also use a new PVC in this pod) and then dump the data from the old pod and import them to the new pod:

$ oc exec old-postgres-pod -- pg_dumpall -U postgres > dump
$ oc exec -it new-postgres-pod -- psql -U postgres < dump

The postgres service then needs to be linked to the new pod and the old pod and PVC can be deleted.

FrNecas/deployment