Troubleshooting Kubernetes Applications

Table of contents:

Preparation	Intro	Poking pods
Storage	Network	Security
Observability	Vaccination	References

To demonstrate the different issues and failures as well as how to fix them, I've been using the commands and resources as shown below.

NOTE: whenever you see a 📄 icon, it means this is a reference to the official Kubernetes docs.

Prerequisits

Kubernetes 1.16 or higher

Preparation

Before starting, set up:

# create the namespace we'll be operating in:
kubectl create ns vnyc

# in different tmux pane keep an eye on the resources:
watch kubectl -n vnyc get all

Intro

Using 00_intro.yaml:

kubectl -n vnyc apply -f 00_intro.yaml

kubectl -n vnyc describe deploy/unhappy-camper

THEPOD=$(kubectl -n vnyc get po -l=app=whatever --output=jsonpath={.items[*].metadata.name})
kubectl -n vnyc describe po/$THEPOD
kubectl -n vnyc logs $THEPOD
kubectl -n vnyc exec -it $THEPOD -- sh

kubectl -n vnyc delete deploy/unhappy-camper

Poking pods

Pod lifecycle

Download in original resolution.

Image issue

Using 01_pp_image.yaml:

# let's deploy a confused image and look for the error:
kubectl -n vnyc apply -f 01_pp_image.yaml
kubectl -n vnyc get events | grep confused | grep Error

# fix it by specifying the correct image:
kubectl -n vnyc patch deployment confused-imager  \
                --patch '{ "spec" : { "template" : { "spec" : { "containers" : [ { "name" : "something" , "image" : "mhausenblas/simpleservice:0.5.0" } ] } } } }'

kubectl -n vnyc delete deploy/confused-imager

Relevant real-world examples on StackOverflow:

Keeps crashing

Using 02_pp_oomer.yaml and 02_pp_oomer-fixed.yaml:

# prepare a greedy fellow that will OOM:
kubectl -n vnyc apply -f 02_pp_oomer.yaml

# wait > 5s and then check mem in container:
kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=oomer --output=jsonpath={.items[*].metadata.name}) -c greedymuch -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes /sys/fs/cgroup/memory/memory.usage_in_bytes


kubectl -n vnyc describe po $(kubectl -n vnyc get po -l=app=oomer --output=jsonpath={.items[*].metadata.name})

# fix the issue:
kubectl -n vnyc apply -f 02_pp_oomer-fixed.yaml

# wait > 20s
kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=oomer --output=jsonpath={.items[*].metadata.name}) -c greedymuch -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes /sys/fs/cgroup/memory/memory.usage_in_bytes

kubectl -n vnyc delete deploy wegotan-oomer

Relevant real-world examples on StackOverflow:

Something's wrong with the app

Using 03_pp_logs.yaml:

kubectl -n vnyc apply -f 03_pp_logs.yaml

# nothing to see here:
kubectl -n vnyc describe deploy/hiccup

# but I see it in the logs:
kubectl -n vnyc logs --follow $(kubectl -n vnyc get po -l=app=hiccup --output=jsonpath={.items[*].metadata.name})

kubectl -n vnyc delete deploy hiccup

Relevant real-world examples on StackOverflow:

References:

Storage

Using 04_storage-failedmount.yaml and 04_storage-failedmount-fixed.yaml:

kubectl -n vnyc apply -f 04_storage-failedmount.yaml

# has the data been written?
kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=wheresmyvolume --output=jsonpath={.items[*].metadata.name}) -c writer -- cat /tmp/out/data

# has the data been read in?
kubectl -n vnyc exec -it $(kubectl -n vnyc get po -l=app=wheresmyvolume --output=jsonpath={.items[*].metadata.name}) -c reader -- cat /tmp/in/data

kubectl -n vnyc describe po $(kubectl -n vnyc get po -l=app=wheresmyvolume --output=jsonpath={.items[*].metadata.name})

kubectl -n vnyc apply -f 04_storage-failedmount-fixed.yaml

kubectl -n vnyc delete deploy wheresmyvolume

Relevant real-world examples on StackOverflow:

References:

Debugging Kubernetes PVCs
Further references see Stateful Kubernetes

Network

Using 05_network-wrongsel.yaml and 05_network-wrongsel-fixed.yaml:

kubectl -n vnyc run webserver --image nginx --port 80

kubectl -n vnyc apply -f 05_network-wrongsel.yaml 

kubectl -n vnyc run -it --rm debugpod --restart=Never --image=centos:7 -- curl webserver.vnyc

kubectl -n vnyc run -it --rm debugpod --restart=Never --image=centos:7 -- ping webserver.vnyc

kubectl -n vnyc run -it --rm debugpod --restart=Never --image=centos:7 -- ping $(kubectl -n vnyc get po -l=run=webserver --output=jsonpath={.items[*].status.podIP})

kubectl -n vnyc apply -f 05_network-wrongsel-fixed.yaml 

kubectl -n vnyc delete deploy webserver

Other scenarios often found:

See an error message that says something like connection refused? You could be hitting the 127.0.0.1 issue with the solution to make the app listen on 0.0.0.0 rather than on localhost. Further, see also some discussion here.
Missing firewall rules, from cluster-internal open ports to communication between clusters can cause all kinds of issues. It very much depends on the environment (AWS, Azure, GCP, on-premises, etc.) how exactly you go about it and most certainly is an infra admin task rather than an appops task.
Taking a pod offline for debugging: on the pod, simply remove the relevant label(s) the service uses in its selector and that removes the pod from the pool of endpoints the service has to serve traffic to while leaving the pod running, ready for you to kubectl exec -it in.

Relevant real-world examples on StackOverflow:

References:

Debug Services 📄
Troubleshooting Kubernetes Networking Issues
Further references see Container Networking

Security

kubectl -n vnyc create sa prober
kubectl -n vnyc run -it --rm probepod --serviceaccount=prober --restart=Never --image=centos:7 -- sh

# in the container; will result in an 403, b/c we don't have the permissions necessary:
export CURL_CA_BUNDLE=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
APISERVERTOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -H "Authorization: Bearer $APISERVERTOKEN"  https://kubernetes.default/api/v1/namespaces/vnyc/pods

# different tmux pane, verify if the SA actually is allowed to:
kubectl -n vnyc auth can-i list pods --as=system:serviceaccount:vnyc:prober

# … seems not to be the case, so give sufficient permissions:
kubectl create clusterrole podreader \
        --verb=get --verb=list \
        --resource=pods

kubectl -n vnyc create rolebinding allowpodprobes \
        --clusterrole=podreader \
        --serviceaccount=vnyc:prober \
        --namespace=vnyc

# clean up
kubectl delete clusterrole podreader && kubectl delete ns vnyc

Relevant real-world examples on StackOverflow:

References see kubernetes-security.info.

Observability

From metrics (Prometheus and Grafana) to logs (EFK/ELK stack) to tracing (OpenCensus and OpenTracing).

Service ops in practice

Show Linkerd 2.0 in action using this Katacoda scenario as a starting point.

Distributed tracing and debugging

Show Jaeger 1.6 in action using this Katacoda scenario.

References:

Vaccination

Show chaoskube in action, killing off random pods in the vnyc namespace.

We have the following setup:

                                                         +----------------+   
                                                         |                |   
                                                 +-----> | webserver/pod1 |   
                                                 |       |                |   
+----------------+                               |       +----------------+   
|                |                               |       +----------------+   
| appserver/pod1 +--------+         +---------+  |       |                |   
|                |        |      +--+         |  +-----> | webserver/pod2 |   
+----------------+        |     X             |  |       |                |   
                          |    X              |  |       +----------------+   
                          |   X               |  |       +----------------+   
                          v  X                |  |       |                |   
                            X   svc/webserver +--------> | webserver/pod3 |   
                          ^  X                |  |       |                |   
+----------------+        |   X               |  |       +----------------+   
|                |        |    X              |  |       +----------------+   
| appserver/pod2 +--------+     X             |  |       |                |   
|                |              +--+          |  +-----> | webserver/pod4 |   
+----------------+                 +----------+  |       |                |   
                                                 |       +----------------+   
                                                 |       +----------------+   
                                                 |       |                |   
                                                 +-----> | webserver/pod5 |   
                                                         |                |   
                                                         +----------------+

That is, a webserver running with five replicas along with a service as well as an appserver running with two replicas that queries said service.

# let's create our victims, that is webservers and appservers:
kubectl create ns vnyc
kubectl -n vnyc run webserver --image nginx --port 80 --replicas 5
kubectl -n vnyc expose deploy/webserver
kubectl -n vnyc run appserver --image centos:7 --replicas 2 -- sh -c "while true; do curl webserver ; sleep 10 ; done"
kubectl -n vnyc logs deploy/appserver --follow

# also keep on the events generated: 
kubectl -n vnyc get events --watch

# now release the chaos monkey:
chaoskube \
    --interval 30s \
    --namespaces 'vnyc' \
    --no-dry-run

kubectl delete ns vnyc

And here's a screen shot of chaoskube in action, with all the above commands applied:

References:

References

General

Troubleshoot Applications 📄
Troubleshoot Clusters 📄
A site dedicated to Kubernetes Troubleshooting
Debug a Go Application in Kubernetes from IDE
CrashLoopBackoff, Pending, FailedMount and Friends: Debugging Common Kubernetes Cluster (KubeCon NA 2017): video and slide deck
10 Most Common Reasons Kubernetes Deployments Fail: Part 1 and Part 2

Language or platform specific

Debugging Microservices: How Google SREs Resolve Outages
Debugging Microservices: Lessons from Google, Facebook, Lyft
Troubleshooting Java applications on OpenShift
Google Kubernetes Engine Troubleshooting docs

unixdaddy/troubleshooting-k8s-apps

Troubleshooting Kubernetes Applications

Prerequisits

Preparation

Intro

Poking pods

Pod lifecycle

Image issue

Keeps crashing

Something's wrong with the app

Storage

Network

Security

Observability

Service ops in practice

Distributed tracing and debugging

Vaccination

References

General

Language or platform specific