The goal of this project is to demonstrate each of the technologies that have been adopted by the [Cloud Native Computing Foundation] (http://cncf.io) (CNCF) in a publicly available repository in order to facilitate their understanding through simple deployment tooling and by providing sample applications as common-ground for conversation. This project will enable replicable deployments and facilitate quantification of performance, latency, throughput, and cost between various deployment models.
- [Project Overview] (#projectoverview)
- [Sample Applications] (#sampleapps)
- [Deployment Models] (#deploymentmodels)
- [Getting Started] (#quickstart)
- [Dependencies] (#dependencies)
- [Create Cluster] (#cluster)
- [Run Demo] (#demo)
- [Architecture] (#arch)
- [Kubernetes Architecture] (#kubearch)
- Kubernetes - [Project] (http://kubernetes.io), [Source] (https://github.com/kubernetes/kubernetes)
- Prometheus - [Project] (https://prometheus.io), [Source] (https://github.com/prometheus)
- Count.ly - [Project] (https://count.ly), [Source] (https://github.com/countly/countly-server)
- Goals:
- demonstrate autoscaling of Countly
- illustrate background, lower priority jobs vs foreground, higher priority jobs
- [Details of sample application] (#countly)
- Boinc - [Project] (http://boinc.berkeley.edu), [Source] (https://github.com/BOINC)
- Goals:
- demonstrate grid computing use case
- contribute cycles to curing [Zika] (https://en.wikipedia.org/wiki/Zika_virus) in IBM [World Community Grid] (https://www.worldcommunitygrid.org)
- Details of sample application
A variety of deployments models will be supported. Support for each deployment model will be delivered in this order of priority:
- Local (on your machine)
- [CNCF Community Cluster] (#cncf-cluster)
- AWS
- Azure
- GCP
- Packet
Given this breadth of supported deployment models using the same sample applications, performance, cost, etc. characteristics between this variety of clusters may be compared.
Donated by Intel, a 1,000 node cluster of servers is running in Switch, Las Vegas, to be used by the CNCF community. Visit these links for a description of the cluster [project page] (https://cncf.io/cluster) or to be involved in the [cluster community] (https://github.com/cncf/cluster).
The project output will be an open source Github repo that will become widely referenced within the CNCF community. All work will occur on a public repo, all externally referenced projects will be open source, and this project itself will be licensed under Apache 2.0.
Note that these are explicitly marketing demos, not reference stacks. The CNCF’s [Technical Oversight Committee] (https://github.com/cncf/toc) will over time be adopting additional projects and may eventually publish reference stacks. By contrast, this project is designed to take the shortest possible path to successful multi-cloud deployments of diverse applications.
Getting started with the cncfdemo
is a three-step process:
- [Install cncfdemo] (#dependencies)
- [Create a Kubernetes cluster, running Prometheus] (#cluster)
- [Run demo apps & benchmarks] (#demo)
-
Run
pip install cncfdemo
pip is the python package manager. It is strongly recommended to also use a dedicated python virtualenv. For detailed install instructions for your platform read: The Hitchhiker's Guide to Python.
cncfdemo bootstrap aws
AWS is used as an example. Substitute with your provider of choice. Note: Grab a beverage, this step takes several minutes.
- Run
cncfdemo start
- Browse to countly.cncfdemo.io
- Run
cncfdemo benchmark --time 5m --save-html
❯ cncfdemo create configmap example --from-file=path/to/directory
❯ kubectl create configmap example --from-file=path/to/directory
cncfdemo is written in Python and like Kubectl interacts with the remote REST API server. Unlike Kubectl, it supports HTTP only. Further differing from kubectl it is able to create new clusters on your favorite cloud provider (or even bare metal).
In addition to the ability to quickly spin up new clusters from scratch the cncfdemo
command comes with a built in demo of a complex multistep multicomponent deployment.
When you run:
❯ cncfdemo start
The following is going on behind the scenes:
- Prometheus and its Pushgateway are deployed
- Grafana is deployed with preconfigured dashboards to expose metrics collected by Prometheus
- ConfigMaps are created from autodetection of configuration files required by the applications being deployed
- A sharded mongo cluster is provisioned
- One Shot Kuberentes Jobs initialize and configure the mongo cluster
- A mongos service is exposed internally to the cluser
- Multiple instances of Countly are spun up against the mongo cluster
- Countly service is exposed at a human readable subdomain countly.cncfdemo.io via Route53
- HTTP Benchmarking is performed against the Countly subdomain via WRK jobs
- Idle cluster capacity to search for a cure to the Zika virus is donated via Boinc and IBM WorldCommunityGrid
The demo described above is difficult and brittle to put together with regular kubectl
usage. Editing YAML files by hand is time consuming and error prone.
The demo was accomplished with Jinja templating, several advanced kubernetes primitives & patterns that are currently in Alpha, and extending and adding some functionality to the cncfdemo
wrapper - all in order to greatly simplify and reduce the number of commands required to accomplish a complex deployment.
- Additional cloud providers support
- A visualization/UI layer to display the progress of cluster bootstraps, deployments, and benchmarks
This document will walk you through setting up Kubernetes. This guide is for people looking for a fully automated command to bring up a Kubernetes cluster (In fact, this is the basis for the cncfdemo command utility and you can use that directly or learn how to make your own).
Kubernetes is currently experiencing a cambrian explosion1 of deployment methodologies. Described below is an opinionated approach with three guiding principles:
- Minimal dependencies
- Portability between cloud providers, baremetal, and local, with minimal alteration
- Image based deployments bake all components into one single image
If you just want to try it out skip to the Quick start.
Kubernetes components are neatly split up into three distinct groups*.
Diagram of a highly available kubernetes cluster
etcd is a reliable distributed key-value store, its where the cluster state is kept. The Source of Truth. All other parts of Kubernetes are stateless. You could (and should) deploy and manage an etcd cluster completely independently, just as long as Kubernetes masters can connect and use it.
A highly available etcd is well covered by many other guides. The cncf demo and this document eschew such a setup for simplicity's sake. Instead we opt for a single Kubernetes master with etcd installed and available on 127.0.0.1.
Lets zoom in further on one of those circles representing a Kubernetes minion.
*AWS AutoScalingGroups, GCE "Managed Instance Groups", Azure "Scale Sets"
Contents of /etc/sysconfig/kubernetes-minions
:
CLUSTER_NAME=cncfdemo
CLOUD_PROVIDER=--cloud-provider=aws
KUBELET_HOSTNAME=--hostname-override=ip-172-20-0-12.us-west-2.compute.internal
In future kubernetes will autodetect the provider and switch to EC2 instance-ids instead of resolvable hostnames so latter two lines won't be needed.
This file is injected into the instance upon creation via user data.
Kubernetes consists of half a dozen binaries with no dependencies, we include all and create and enable all corresponding services. Accordingly, systemd service files include a conditional along the lines of:
ConditionPathExists=/etc/sysconfig/kubernetes-minions
ConditionPathExists=!/etc/sysconfig/kubernetes-masters
Naturally for a master we create the file with the name 'kubernetes-masters' instead and flip the conditional in the service files.
Master | Minion | |
---|---|---|
kube-apiserver | ✔ | |
kube-controller-manager | ✔ | |
kube-scheduler | ✔ | |
kube-proxy | ✔ | |
kubelet | ✔ | |
kubectl (No service file) |
Later on you will see DNS discovery of services being used extensively within Kubernetes clusters as a core feature. But lets continue with the bootstrap, as per the table a minion has only a kubelet and a kube-proxy.
Contents of kubelet.service
:
[Service]
EnvironmentFile=/etc/kubernetes/config
EnvironmentFile=/etc/kubernetes/kubelet
EnvironmentFile=/etc/sysconfig/kubernetes-minions
ExecStartPre=/bin/bash -c "until /usr/bin/curl -s http://masters.${CLUSTER_NAME}.k8s:8080; do echo \"waiting for master...\"; sleep 5; done"
ExecStart=/bin/kubelet \
--api-servers=http://masters.${CLUSTER_NAME}.k8s:8080 \
$CLOUD_PROVIDER \
$KUBELET_ARGS
The precondition ensures the master is reachable and responsive. The endpoint is by convention 'masters.cncfdemo.k8s'.
There are many other ways to do this, however, this approach is not provider specific.
For AWS the principles are outlined in Building a Dynamic DNS for Route 53 using CloudWatch Events and Lambda.
The process is reduced to the following;
- Configure CloudWatch to trigger a Lambda function on any and all AutoScalingGroup events
- Lambda funtion simply sets a DNS record set of the private 'k8s' domain to reflect the list of healthy instances in that group
As a result, an AutoScalingGroup with Tags:
KubernetesCluster, Role
, will always have membership correctly reflected via '{Role}.{KubernetesCluster}.k8s' DNS lookups.
At this point you might be wondering why minions.{KubernetesCluster}.k8s
is needed. The masters subdomain is useful because the minions need to point at the masters. But who needs to point at minions? The answer is the other minions.
Kubernetes has a distinctive networking model.
Kubernetes allocates an IP address to each pod. When creating a cluster, you need to allocate a block of IPs for Kubernetes to use as Pod IPs. -- Kubernetes Docs
In order to let Pod A (10.32.0.1) from one minion node (172.20.0.1) communicate with Pod B (10.32.0.3) on another minion node (172.20.0.2) we use an Overlay network. It is possible to achieve this sort of routing without an overlay network (and associated performance penalty) but an overlay is simpler to configure and more importantly it is portable.
CNI, the Container Network Interface, is a proposed standard for configuring network interfaces for Linux application containers. CNI is supported by Kubernetes, Apache Mesos and others.
Required directories for CNI plugin:
- /opt/cni/bin
- /etc/cni/net.d
The default cni plugin binaries need to be placed in /opt/cni/bin/
. We have opted to use Weave, its setup script adds weave binaries into this directory as well.
Finally, we direct the Kubelet to use the above:
KUBELET_ARGS="--network-plugin=cni --network-plugin-dir=/etc/cni/net.d --docker-endpoint=unix:///var/run/weave/weave.sock"
Kubernetes will now rely on the Weave service to allocate the ips in the oerlay network.
PEERS=$(getent hosts minions.cncfdemo.k8s | awk '{ printf "%s ", $1 }')
MEMBERS=$(getent hosts minions.cncfdemo.k8s | wc -l)
/usr/local/bin/weave launch-router --ipalloc-init consensus=$MEMBERS ${PEERS}
You can read further details on Weave initialization strategies. We are using the consensus strategy. In keeping with our example:
- PEERS=172.20.0.63 172.20.0.64
- MEMBERS=2
Weave Net uses the estimate of the number of peers at initialization to compute a majority or quorum number – specifically floor(n/2) + 1.
If the actual number of peers is less than half the number stated, then they keep waiting for someone else to join in order to reach a quorum.
Once the quorom has been reached you can see how the IP allocation has been divvied up between the members.
weave status ipam
82:85:7e:7f:71:f3(ip-172-20-0-63) 32768 IPs (50.0% of total)
ce:38:5e:9d:35:ab(ip-172-20-0-64) 32768 IPs (50.0% of total)
For a deeper dive on how this mechanism works: Distributed systems with (almost) no consensus.
Countly is an open source web & mobile analytics and marketing platform. It provides insights about user actions.
Two configuration files used to dictate the behavior of this demo application are [api.js] (configMaps/countly/api.js) and [frontend.js] (configMaps/countly/frontend.js). Each configuration file contains only one change from the default configurationonly line changed from the default config:
Host: "mongos:default"
By setting Host
to "mongos.default", the Countly application looks for its MongoDB servers at the address "mongos.default". The "mongos.default" reference resolves to a Kubernetes service called mongos. The .default namespace is the default top-level domain for pods and services deployed in Kubernetes.
-
Patterns and Best Practices
- How to adapt your app to run in Kubernetes (Countly example in detail)
- Clustred Datastores on top of Kubernetes (Mongo example in detail)
- Making use of spare capcity with Background Jobs (Boinc example in detail)
-
Complex, Scriptable Kubernetes Deployments & Jinja Templating
What actually happens when you 'cncfdemo start'
Inevitably when working with containers the question of the base image comes up. Is it better to opt for the spartan minimalism of Alpines and Busyboxes or a bog standard 700MB CentOS?
Should you split your app into multiple containers each running a single process or bake everything into one blob?
The (typical engineer) answer is "it depends".
Take a complex app like Countly for example. To package it up convinently, so a developer can quickly try it out on her laptop for instance, it is neccessary to bundle Mongo, Nginx, NodeJS, the Countly API server app, and the dashboard UI app.
You can't always run one process per container. What you really might crave in such a situation is a process control system or even a proper init.
Traditionally a Docker container runs a single process when it is launched, for example an Apache daemon or a SSH server daemon. Often though you want to run more than one process in a container. There are a number of ways you can achieve this ranging from using a simple Bash script as the value of your container’s CMD instruction to installing a process management tool. - Docker's documentation on using supervisord
There's several such supervisors, a popular one being runit. Runit is written in C and uses less resources than supervisord, adheres to the unix philosophy of utilities doing one thing well, and is very reliable.
There's a subtle problem of Docker and PID 1 zombie reaping the aformentioned process supervisors alone don't solve.
The Ubuntu based phusion baseimage works around this with a small (340 line) my_init script.
Ideally, the PID 1 problem is solved natively by Docker. It would be great if Docker supplies some builtin init system that properly reaps adopted child processes. But as of January 2015, we are not aware of any effort by the Docker team to address this.
As of September 20, 2016, this is finally fixed by Docker upstream with an optional small new daemon that gets injected with --init=true
.
Countly provides an official docker image based on Phusion the advantages and considerations of which are outlined above.
We extend it and simply keep the services we want to use:
FROM countly/countly-server:16.06
# Add custom Countly configs - these in turn come from k8s volume
ADD ./runit/countly-api.sh /etc/service/countly-api/run
ADD ./runit/countly-dashboard.sh /etc/service/countly-dashboard/run
#!/usr/bin/env bash
cp /etc/config/api.js /opt/countly/api/config.js
chown countly:countly /opt/countly/api/config.js
exec /sbin/setuser countly /usr/bin/nodejs /opt/countly/api/api.js
countly-api.sh is almost exactly like the file we replaced.
This service file is executed by runit and clobbers the default config each time. The file /etc/config/api.js
is not actually permenantly baked into the image but rather arrives via a Kubernetes configmap.
And here we've had to resort to a bit of a hack. ConfigMap backed volumes mounted as root is a known and open issue. At the moment there's no way to specify permissions. Hence, the chown line.
We've completely gotten rid of the Nginx service countly bundles as edge routing can be done any number of ways elsewhere with Kubernetes.
Whether or not we split apart the dashboard app and the API server is not a question of convenience or style. The API server clearly maps to a replication controller and can be horizontally auto scaled with custom metrics (more on this later).
The dashboard app for our purposes has no high availability requirment and is rarely used. However, even when idle, it is taking up resources on the pod and this waste is multipled across however many API servers we end up -- whereas we only need one dashboard app running at a time.
The clean way is to split it out further to one seperate pod on the side.
As for mongo, both of these service contain a connection string we pass as a configmap like so:
mongodb: {
host: "mongos.default",
db: "countly",
port: 27017,
max_pool_size: 500,
}
As a result, it is up to us to deploy and scale mongo seperatly from countly. Even if this particular mongo cluster is dedicated entirely to countly, and it should be, this seperation of concerns is good for maintainability and resilience.
This decoupling is healthy. For example, a bug in one of the horizontally scaled countly API servers that causes a crash would not take a mongo pod along with it and thus the impact on overall performance is contained. Instead it will crash and burn on the side, the liveliness tests will fail, and Kubernetes in turn will transparantly route away further requests to siblings while simultanousely launching a replacement.
- graph showing how chaos-monkey style killing one of the countlies impacts overall writes, and for how long (fast recovery is cool)
Ask yourself, does it make sense for Countly pods to be running when there is no Mongo backend available to them? The answer is no. In fact, if Countly happens to start first the deployment becomes unpredictable.
Luckily init containers have reached beta status in the latest version of Kubernetes (1.4). In short, with this feature enabled the pod you normally specify alone now starts last in a blocking list.
pod:
spec:
containers: ...
initContainers:
- name: init-container1
image: ...
...
- name: init-container2
Containers:
- name: regularcontainer
init-container1 can have a simple command along the lines of nslookup monogos.default | -ge 3
. This will fail as long as the mongo service is not up and running and passing its readyness and liveliness probes. Countly will be blocked from starting until the init container succeeds, exactly the desired behaviour.
Separate article?