💰 KubeSurvival

KubeSurvival allows you to significantly reduce your Kubernetes compute costs by finding the cheapest machine types that can run your workloads successfully.

If you have a multi-tenant environment, ML training jobs, a large number of ML model servers, etc, this tool can help you optimize your K8s compute costs.

To easily define workloads, KubeSurvival uses a very simple DSL:

  (
    # Some microservice
    pod(cpu: 1, memory: "1Gi") + 

    # Another microservice - with 3 replicas
    pod(cpu: "500m", memory: "2Gi") * 3 +

    # More microservices!
    (
      pod(cpu: 1, memory: "1Gi") +
      pod(cpu: "250m", memory: "1Gi")
    ) * 3
  ) * 2  # Production, Staging

This will give you a result such as:

Instance type: t3.medium
Node count: 11
Total Price per Month: USD $340.45

Made with ❤️ by Aporia

Installation

Download a precompiled binary for your operating system from the Releases page.

Alternatively, if you have Go installed, you can run:

$ go install github.com/aporia-ai/kubesurvival/v2

Usage

To run KubeSurvival:

./kubesurvival config.yaml

See the examples directory for example config files.

How does it work?

KubeSurvival uses k8s-cluster-simulator to simulate Kubernetes pod scheduling, without running on the actual underlying machines. It iterates over all possible instance types and node counts, simulates a K8s cluster with your workload, and checks if there are any pending pods.

For each simulation it calculates the on-demand cost per month using the ec2-instances-info library. Additionally, it queries the eni-max-pods.txt file to determine what's the maximum number of pods in each instance type.

When simulating a cluster, KubeSurvival always makes sure you have 10% free CPU and Memory on each node.

Finally, KubeSurvival selects the cheapest configuration without pending pods.

What's missing from this?

Well... a lot actually. Here's a partial list:

Support for AKS and GKE
Support for calculating costs of EBS storages
Support for different node groups (e.g 2 machines with GPU + 4 machines without GPU)
and probably much more!

We would love your help! ❤️