/kubesurvival

💰 Significantly reduce Kubernetes costs by finding the cheapest machine types that can run your workloads

Primary LanguageGoApache License 2.0Apache-2.0

💰 KubeSurvival

GitHub release (latest SemVer) GitHub Workflow Status Maintainability Test Coverage

KubeSurvival allows you to significantly reduce your Kubernetes compute costs by finding the cheapest machine types that can run your workloads successfully.

If you have a multi-tenant environment, ML training jobs, a large number of ML model servers, etc, this tool can help you optimize your K8s compute costs.

To easily define workloads, KubeSurvival uses a very simple DSL:

  (
    # Some microservice
    pod(cpu: 1, memory: "1Gi") + 

    # Another microservice - with 3 replicas
    pod(cpu: "500m", memory: "2Gi") * 3 +

    # More microservices!
    (
      pod(cpu: 1, memory: "1Gi") +
      pod(cpu: "250m", memory: "1Gi")
    ) * 3
  ) * 2  # Production, Staging

This will give you a result such as:

Instance type: t3.medium
Node count: 11
Total Price per Month: USD $340.45

Made with ❤️ by Aporia

Installation

Download a precompiled binary for your operating system from the Releases page.

Alternatively, if you have Go installed, you can run:

$ go install github.com/aporia-ai/kubesurvival/v2

Usage

To run KubeSurvival:

./kubesurvival config.yaml

See the examples directory for example config files.

How does it work?

KubeSurvival uses k8s-cluster-simulator to simulate Kubernetes pod scheduling, without running on the actual underlying machines. It iterates over all possible instance types and node counts, simulates a K8s cluster with your workload, and checks if there are any pending pods.

For each simulation it calculates the on-demand cost per month using the ec2-instances-info library. Additionally, it queries the eni-max-pods.txt file to determine what's the maximum number of pods in each instance type.

When simulating a cluster, KubeSurvival always makes sure you have 10% free CPU and Memory on each node.

Finally, KubeSurvival selects the cheapest configuration without pending pods.

What's missing from this?

Well... a lot actually. Here's a partial list:

  • Support for AKS and GKE
  • Support for calculating costs of EBS storages
  • Support for different node groups (e.g 2 machines with GPU + 4 machines without GPU)
  • and probably much more!

We would love your help! ❤️