Welcome to the EKS Progressive Delivery hands-on featuring Flux v2, Flagger and AWS App Mesh.
Install eksctl, yq and the Flux CLI:
brew install eksctl yq fluxcd/tap/flux
In order to follow the guide you'll need a GitHub account and a
personal access token
that can create repositories (check all permissions under repo
).
Fork this repository on your personal GitHub account and export your access token, username and repo:
export GITHUB_TOKEN=<your-token>
export GITHUB_USER=<your-username>
export GITHUB_REPO=gitops-appmesh
Clone the repository on your local machine:
git clone https://github.com/${GITHUB_USER}/${GITHUB_REPO}.git
cd ${GITHUB_REPO}
Create a cluster with eksctl:
eksctl create cluster -f .eksctl/config.yaml
The above command with create a Kubernetes cluster v1.18 with two m5.large
nodes in the us-west-2 region.
Verify that your EKS cluster satisfies the prerequisites with:
$ flux check --pre
► checking prerequisites
✔ kubectl 1.19.4 >=1.18.0
✔ Kubernetes 1.18.9-eks-d1db3c >=1.16.0
✔ prerequisites checks passed
Install Flux on your cluster with:
flux bootstrap github \
--owner=${GITHUB_USER} \
--repository=${GITHUB_REPO} \
--branch=main \
--personal \
--path=clusters/appmesh
The bootstrap command commits the manifests for the Flux components in clusters/appmesh/flux-system
dir
and creates a deploy key with read-only access on GitHub, so it can pull changes inside the cluster.
Wait for the cluster reconciliation to finish:
$ watch flux get kustomizations
NAME REVISION READY
apps main/582872832315ffca8cf24232b0f6bcb942131a1f True
cluster-addons main/582872832315ffca8cf24232b0f6bcb942131a1f True
flux-system main/582872832315ffca8cf24232b0f6bcb942131a1f True
mesh main/582872832315ffca8cf24232b0f6bcb942131a1f True
mesh-addons main/582872832315ffca8cf24232b0f6bcb942131a1f True
Verify that Flagger, Prometheus, AppMesh controller and gateway Helm releases have been installed:
$ flux get helmreleases --all-namespaces
NAMESPACE NAME REVISION READY
appmesh-gateway appmesh-gateway 0.1.5 True
appmesh-system appmesh-controller 1.2.0 True
appmesh-system appmesh-prometheus 1.0.0 True
appmesh-system flagger 1.2.0 True
kube-system metrics-server 5.0.1 True
To experiment with progressive delivery, you'll be using a small Go application called podinfo. The demo app is exposed outside the cluster with AppMesh Gateway. The communication between the gateway and podinfo is managed by Flagger and AppMesh.
The application manifests are comprised of a Kubernetes deployment, a horizontal pod autoscaler, a gateway route (AppMesh custom resource) and release polices (Flagger custom resources).
./apps/podinfo/
├── abtest.yaml
├── canary.yaml
├── deployment.yaml
├── gateway-route.yaml
├── hpa.yaml
└── kustomization.yaml
Based on the release policy, Flagger configures the mesh and bootstraps the application inside the cluster.
Wait for Flagger to initialize the canary:
$ watch kubectl -n apps get canary
NAME STATUS WEIGHT LASTTRANSITIONTIME
podinfo Initialized 0 2020-11-14T12:03:39Z
Find the AppMesh Gateway public address with:
export URL="http://$(kubectl -n appmesh-gateway get svc/appmesh-gateway -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')"
echo $URL
Wait for the DNS to propagate and podinfo to become accessible:
$ watch curl -s ${URL}
{
"hostname": "podinfo-primary-5cf44b9799-lgq79",
"version": "5.0.0"
}
When the URL becomes available, open it in a browser and you'll see the podinfo UI.
When you deploy a new podinfo version, Flagger gradually shifts traffic to the canary, and at the same time, measures the requests success rate as well as the average response duration. Based on an analysis of these App Mesh provided metrics, a canary deployment is either promoted or rolled back.
The canary analysis is defined in apps/podinfo/canary.yaml:
analysis:
# max traffic percentage routed to canary
maxWeight: 50
# canary increment step
stepWeight: 5
# time to wait between traffic increments
interval: 15s
# max number of failed metric checks before rollback
threshold: 5
# AppMesh Prometheus checks
metrics:
- name: request-success-rate
# minimum req success rate percentage (non 5xx)
thresholdRange:
min: 99
interval: 1m
- name: request-duration
# maximum req duration in milliseconds
thresholdRange:
max: 500
interval: 1m
Pull the changes from GitHub:
git pull origin main
Bump podinfo version from 5.0.0
to 5.0.1
:
yq w -i ./apps/podinfo/kustomization.yaml 'images[0].newTag' 5.0.1
Commit and push changes:
git add -A && \
git commit -m "podinfo 5.0.1" && \
git push origin main
Tell Flux to pull the changes or wait one minute for Flux to detect the changes:
flux reconcile source git flux-system
Wait for the cluster reconciliation to finish:
watch flux get kustomizations
When Flagger detects that the deployment revision changed, it will start a new rollout. You can monitor the traffic shifting with:
watch kubectl -n apps get canary
Watch Flagger logs:
$ kubectl -n appmesh-system logs deployment/flagger -f | jq .msg
New revision detected! Scaling up podinfo.apps
Starting canary analysis for podinfo.apps
Pre-rollout check acceptance-test passed
Advance podinfo.apps canary weight 5
Advance podinfo.apps canary weight 10
Advance podinfo.apps canary weight 15
Advance podinfo.apps canary weight 20
Advance podinfo.apps canary weight 25
Advance podinfo.apps canary weight 30
Advance podinfo.apps canary weight 35
Advance podinfo.apps canary weight 40
Advance podinfo.apps canary weight 45
Advance podinfo.apps canary weight 50
Copying podinfo.apps template spec to podinfo-primary.apps
Routing all traffic to primary
Promotion completed! Scaling down podinfo.apps
Lastly, open up podinfo in the browser. You'll see that as Flagger shifts more traffic to the canary according to the policy in the Canary object, we see requests going to our new version of the app.
Besides weighted routing, Flagger can be configured to route traffic to the canary based on HTTP match conditions. In an A/B testing scenario, you'll be using HTTP headers or cookies to target a certain segment of your users. This is particularly useful for frontend applications that require session affinity.
Enable A/B testing:
yq w -i ./apps/podinfo/kustomization.yaml 'resources[0]' abtest.yaml
The above configuration will run a canary analysis targeting users based on their browser user-agent.
The A/B test routing is defined in apps/podinfo/abtest.yaml:
analysis:
# number of iterations
iterations: 10
# time to wait between iterations
interval: 15s
# max number of failed metric checks before rollback
threshold: 5
# user segmentation
match:
- headers:
user-agent:
regex: ".*(Firefox|curl).*"
Bump podinfo version to 5.0.2
:
yq w -i ./apps/podinfo/kustomization.yaml 'images[0].newTag' 5.0.2
Commit and push changes:
git add -A && \
git commit -m "podinfo 5.0.2" && \
git push origin main
Tell Flux to pull changes:
flux reconcile source git flux-system
Wait for Flagger to start the A/B test:
$ kubectl -n appmesh-system logs deploy/flagger -f | jq .msg
New revision detected! Scaling up podinfo.apps
Starting canary analysis for podinfo.apps
Pre-rollout check acceptance-test passed
Advance podinfo.apps canary iteration 1/10
Open the podinfo URL in Firefox and you will be routed to version 5.0.2
or use curl:
$ curl ${URL}
{
"hostname": "podinfo-6cf9c5fd49-9fzbt",
"version": "5.0.2"
}
During the canary analysis you can generate HTTP 500 errors and high latency to test if Flagger pauses and rolls back the faulted version.
Generate HTTP 500 errors every 30s with curl:
watch -n 0.5 curl ${URL}/status/500
When the number of failed checks reaches the canary analysis threshold, the traffic is routed back to the primary and the canary is scaled to zero.
$ kubectl -n appmesh-system logs deploy/flagger -f | jq .msg
Advance podinfo.apps canary iteration 2/10
Halt podinfo.apps advancement success rate 98.82% < 99%
Halt podinfo.apps advancement success rate 97.93% < 99%
Halt podinfo.apps advancement success rate 97.51% < 99%
Halt podinfo.apps advancement success rate 98.08% < 99%
Halt podinfo.apps advancement success rate 96.88% < 99%
Rolling back podinfo.apps failed checks threshold reached 5
Canary failed! Scaling down podinfo.apps
If you go back to Firefox, you'll see that the podinfo version has been rollback to 5.0.1
.
Note that on Chrome or Safari, users haven't been affected by the faulty version,
as they were not routed to 5.0.2
during the analysis.
Suspend the cluster reconciliation:
flux suspend kustomization cluster-addons
Delete the demo app and mesh addons:
flux delete kustomization apps -s
flux delete kustomization mesh-addons -s
Delete the AppMesh mesh:
kubectl delete mesh --all
Delete the EKS cluster:
eksctl delete cluster -f .eksctl/config.yaml