Trillion Sort

Trillion sort is a demo batch application that sorts a large quantity of numbers.

Problem

Given a large quantity of numbers, sort them. This is classic problem in computer science. Most folks would understand the problem of sorting. Anybody that has dealt with software at any capacity has come across the concept of sorting. Someone that went to school studying computer science has also encountered sorting algorithms. So sorting is not anything novel. The problem is that the quantity of numbers is so large that it cannot be stored in memory. This is a problem that is not unique to sorting. It is a problem that is faced by many applications. This application is meant to be a generic solution to problems of this nature.

Possible Solutions

Use a database

This would technically work. But for the problem this will be costly, overkill and slow.
Use a message queue to send the numbers in buckets sort individual buckets and join them sequentially

This would work. But to push through a trillion numbers through a message queue would be a slow process. Also the message queue would need to be able to handle a trillion messages. We would save some time in getting the final file as each sorted file just need to be appended in order. But the network overhead would be bad enough to make this a less than optimal solution.
Use a distributed file system to store the numbers and sort them in parallel. Then join the files in a k way merge. A similar problem to this is a pretty common problem for interviews. You can learn more about Merge K Sorted Array.

This is how we are implementing this solution. This is not perfect as we still rely on network for file storage. But thats the price of distributed computing.

How does it work

There are two different application here. One that generates the numbers and one that sorts them.

Generator

This is somewhat trivial. To sort the number we first need some numbers. This application generates a large quantity of numbers.

This has two parts.

Generate Number: This generates random numbers given count and writes it to a file.
Join Files: This joins all the files generated by the first part.

Sorter

This has 3 parts. Sorting the entire file at once is not possible. (See the problem statement). So we need to split the file into multiple files. Sort each file and then merge the files.

Split File: This splits the file into multiple files.
Sort File: This sorts the file.
Merge Files: This merges all the files generated by the second part.

How to run

You are here not to learn about sorting. You want to see how to run this application. So here is how you do it.

The example was tested on GKE. But it should work on any kubernetes cluster. You would have to setup the PVC and storage class for the cluster. The examples would work on GKE directly.

Create a GKE cluster

gcloud container clusters create trillion-sort --num-nodes 3 --machine-type n1-standard-4

cd k8s

Setup PVC

kubectl apply -f pvc.yaml

Kubernetes Job

Generate the data.

./k8s-job-generate.sh

Sort the data

./k8s-job-sort.sh

Install Argo Workflow

kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.4.5/install.yaml

In a production environment we would want to setup proper authentication. But for this demo we will use insecure mode.

kubectl patch deployment \
  argo-server \
  --namespace argo \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/args", "value": [
  "server",
  "--auth-mode=server",
  "--secure=false"
]},
{"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/httpGet/scheme", "value": "HTTP"}
]'

We can then access the argo UI using port forwarding.

kubectl -n argo port-forward deployment/argo-server 2746:2746

You can then access the UI at localhost:2746

Argo Workflow