This is a random collection of questions and answers I've collected about running and operating kubernetes clusters(primarily AWS). New questions and answers are welcome.
Kubernetes is designed to be resilient to any individual node failure, master or worker. When a master fails the nodes of the cluster will keep operating, but there can be no changes including pod creation or service member changes until the master is available. When a worker fails, the master stops receiving messages from the worker. If the master does not receive status updates from the worker the node will be marked as NotReady. If a node is NotReady for 5 minutes, the master reschedules all pods that were running on the dead node to other available nodes.
There is a DNS server called skydns which runs in a pod in the cluster, in the kube-system
namespace. That DNS server reads from etcd and can serve up dns entries for kubernetes services to all pods. You can reach any service with the name <service>.<namespace>.svc.cluster.local
. The resolver automatically searches <namespace>.svc.cluster.local
dns so that you should be able to call one service to another in the same namespace with just <service>
.
The only stateful part of a kubernetes cluster is the etcd. The master server runs the controller manager, scheduler, and the API server and can be run as replicas. The controller manager and scheduler in the master servers use a leader election system, so only one controller manager and scheduler is active for the cluster at any time. So an HA cluster generally consists of an etcd cluster of 3+ nodes and multiple master nodes. http://kubernetes.io/docs/admin/high-availability/#master-elected-components
Probably not, they are older and have fewer features than the newer deployment objects.
Use kubectl get deployment <deployment>
. If the DESIRED
, CURRENT
, UP-TO-DATE
are all equal, then the deployment has completed.
A DaemonSet is a set of pods that is run only once on a host. It's used for host-layer features, for instance a network, host monitoring or storage plugin.
In a regular deployment all the instances of a pod are exactly the same, they are indistinguishable and are thus sometimes referred to as "cattle", these are typically stateless applications that can be easily scaled up and down. In a PetSet, each pod is unique and has an identity that needs to be maintained. This is commonly used for more stateful applications like databases.
Within the cluster, most kubernetes services are implemented as a virtual IP called a ClusterIP. A ClusterIP has a list of pods which are in the service, the client sends IP traffic directly to a randomly selected pod in the service, so the ClusterIP isn't actually directly routable even from kubernetes nodes. This is all done with iptables routes. The iptables configuration is managed by the kube-proxy on each node. So only nodes running kube-proxy can talk to ClusterIP members. An alternative to the ClusterIP is to use a "Headless" service by specifying ClusterIP=None, this does not use a virtual IP, but instead just creates a DNS A record for a service that includes all the IP addresses of the pods. The live members of any service are stored in an API object called an endpoint
. You can see the members of a service by doing a kubectl get endpoints <service>
There are two ways, use the NodePort or LoadBalancer service type.
- NodePort. This makes every node in the cluster listen on the specified NodePort, then any node will forward traffic from that NodePort to a random pod in the service.
- LoadBalancer. This provisions a NodePort as above, but then does an additional step to provision a load balancer in your cloud(AWS or GKE) automatically. In AWS it also modifies the Auto-Scaling Group of the cluster so all nodes of that ASG are added to the ELB.
Kubernetes has node affinity which is described here: http://kubernetes.io/docs/user-guide/node-selection/
Kubernetes by default does attempt node anti-affinity, but it is not a hard requirement, it is best effort, but will schedule multiple pods on the same node if that is the only way. http://stackoverflow.com/questions/28918056/does-the-kubernetes-scheduler-support-anti-affinity
You can call the kubernetes API with the appropriate credentials, find the correct API object and extract the hostIP from the output with this convoluted shell command that derives the namespace from the /etc/resolv.conf and gets the pod name from hostname
:
curl -sSk -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/api/v1/namespaces/$(grep svc.cluster.local /etc/resolv.conf | sed -e 's/search //' -e 's/.svc.cluster.local.*//')/pods/$(hostname) \
| grep hostIP | grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'
The above solution is a "works for me and will probably work for some other people, but could easily fail for a variety of reasons"
In AWS specifically, I find it easier to use the AWS metadata API and just curl 169.254.169.254/1.0/meta-data/local-ipv4
See the above answer on "getting the host IP address from inside a pod" for an example of using the API inside a pod
Yes, there's an example here of both an NFS client and server running within pods in the cluster: https://github.com/jsafrane/kubernetes-nfs-example
Yes. But one major downside of that is that ClusterIPs are implemented as iptables rules on cluster clients, so you'd lose the ability to see Cluster IPs and service changes. Because the iptables are managed by kube-proxy you could do this by running a kube-proxy, which is similar to just joining the cluster. You could make all your services Headless(ClusterIP = None), this would give your external servers the ability to talk directly to services if they could use the kubernetes dns. Headless services don't use ClusterIPs, but instead just create a DNS A record for all pods in the service. kube-dns is run inside the cluster as a ClusterIP, so there's a chicken and egg problem with DNS you would have to deal with.
Look at the Downward API: http://kubernetes.io/docs/user-guide/downward-api/ It allows your pods to see labels and annotations and a few other variables using either a mount point or environment variables inside the pod. If those don't contain the information you need, you'll likely need to resort to either using the kubernetes API from within the pod or something entirely separate.
There is no built in functionality for this. helm https://github.com/kubernetes/helm is a popular third party choice for this. Some people use scripts or templating tools like jinja.
Think of request
as a node scheduling unit and a limit
as a hard limit. Kubernetes will attempt to schedule your pods onto nodes based on the sum of the existing requests on the node plus the new pod request. Then setting a limit enforces that cap and makes sure that a pod does not exceed the limit on a node. For simplicity you can just set request and limit to be the same, but you won't be able to pack things tightly onto a node for efficiency. The converse problem is if you use limits which are much larger than requests, there is a danger that pods use resources all the way up to their limit and overrun the node or other pods on the node. CPU may not hard-capped depending on the version of kubernetes/docker, but memory is. Pods exceeding their memory limit will be terminated and rescheduled.
Heapster https://github.com/kubernetes/heapster is included and its metrics are how kubernetes measures CPU and memory in order to use horizontal pod autoscaling(hpas). Heapster can be queried directly with its rest API. Prometheus is also more full featured and popular.
kube-up.sh
is being deprecated, it will work for relatively straightforward installs, but won't be actively developed anymore. kops https://github.com/kubernetes/kops is the new recommended deployment tool especially on AWS. It doesn't support other cloud environments yet, but that is planned.
In addition to regular EC2 ip addresses, kubernetes creates its own cluster internal network. In AWS, each instance in a kubernetes cluster hosts a small subnet(by default a /24), and each pod on that host get its own IP addresses within that node's /24 address space. Whenever a cluster node is added or deleted, the kubernetes controller updates the route table so that all nodes in the VPC can route pod IPs directly to that node's subnet.
If you used kube-up.sh
or kops
to provision your cluster, then it created an AutoScaling Group automatically. You can re-scale that with kops, or update the ASG directly, to grow/shrink the cluster. New instances are provisioned for you and should join the cluster automatically(my experience has been it takes 5-7 minutes for nodes to join).
Add the following metadata annotation to your LoadBalancer service
service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0
Then delete and re-create the service object and it should be a new private ELB.
Add the following metadata annotation to your LoadBalancer service with a comma separated list of CIDRs:
service.beta.kubernetes.io/load-balancer-source-ranges
Each ELB gets its own security group and this annotation will add those CIDR addresses to the allowed source IPs
There is a route table entry for every instance in the cluster which allows other nodes to route to the pods on that node. AWS has a soft limit of 50 routes per route table and a hard limit of 100 routes per route table. So with the default VPC networking you will run into one of these limits. Using one of the overlay networks(flannel, weave, romana) can get around this limit. And kiops-routing is a planned feature to do this directly from AWS nodes. A planned feature is for all these networking plugins to be drop in additions to an existing cluster.
Yes and "Not out of the box". Provision with kops and specify the AZ's you want and your cluster will be a multi-AZ cluster within a single region. The AutoScalingGroups can add nodes to any region you specify. The planned solution for a multi-region cluster is to build separate clusters in each region and use federation to manage multiple clusters. http://kubernetes.io/docs/admin/federation/. It is also possible to build a multi-region cluster using an overlay network like flannel, calico or weave.
Third party project: https://github.com/wearemolecule/route53-kubernetes Future work in core kops: https://github.com/kubernetes/kops/tree/master/dns-controller
Yes. When you declare a PersistentVolumeClaim add the annotation:
volume.alpha.kubernetes.io/storage-class: "foo"
And the PersistentVolumeClaim will automatically create a volume for you and delete it when the PersistentVolumeClaim is deleted, "foo" is meaningless, the annotation just needs to be set.
When using an EBS PersistentVolume and PersistentVolumeClaim, how does kubernetes know which AZ to create a pod in?
It just works. EBS volumes are specific to an Availability Zone, and kubernetes knows which AZ a volume is in. When a new pod needs that volume it the pod is automatically scheduled in the Availability Zone of the volume.
Not currently.
Is there a way to give a pod a separate IAM role that has different permissions than the default instance IAM policy?
This is a third party tool which appears to do this: https://github.com/jtblin/kube2iam
A kubernetes node has some useful pieces of information attached as labels here are some examples:
beta.kubernetes.io/instance-type: c3.xlarge
failure-domain.beta.kubernetes.io/region: us-east-1
failure-domain.beta.kubernetes.io/zone: us-east-1b
kubernetes.io/hostname: ip-172-31-0-10.us-east-1.compute.internal
You can use these for AZ awareness or attach your own labels and use the Downward API for additional flexibility.
With kops, this should be possible. But having more than one kubernetes cluster in a VPC is not supported.