Trillion sort is a demo batch application that sorts a large quantity of numbers.
Given a large quantity of numbers, sort them. This is classic problem in computer science. Most folks would understand the problem of sorting. Anybody that has dealt with software at any capacity has come across the concept of sorting. Someone that went to school studying computer science has also encountered sorting algorithms. So sorting is not anything novel. The problem is that the quantity of numbers is so large that it cannot be stored in memory. This is a problem that is not unique to sorting. It is a problem that is faced by many applications. This application is meant to be a generic solution to problems of this nature.
-
Use a database
This would technically work. But for the problem this will be costly, overkill and slow.
-
Use a message queue to send the numbers in buckets sort individual buckets and join them sequentially
This would work. But to push through a trillion numbers through a message queue would be a slow process. Also the message queue would need to be able to handle a trillion messages. We would save some time in getting the final file as each sorted file just need to be appended in order. But the network overhead would be bad enough to make this a less than optimal solution.
-
Use a distributed file system to store the numbers and sort them in parallel. Then join the files in a k way merge. A similar problem to this is a pretty common problem for interviews. You can learn more about Merge K Sorted Array.
This is how we are implementing this solution. This is not perfect as we still rely on network for file storage. But thats the price of distributed computing.
There are two different application here. One that generates the numbers and one that sorts them.
This is somewhat trivial. To sort the number we first need some numbers. This application generates a large quantity of numbers.
This has two parts.
- Generate Number: This generates random numbers given count and writes it to a file.
- Join Files: This joins all the files generated by the first part.
This has 3 parts. Sorting the entire file at once is not possible. (See the problem statement). So we need to split the file into multiple files. Sort each file and then merge the files.
- Split File: This splits the file into multiple files.
- Sort File: This sorts the file.
- Merge Files: This merges all the files generated by the second part.
You are here not to learn about sorting. You want to see how to run this application. So here is how you do it.
The example was tested on GKE. But it should work on any kubernetes cluster. You would have to setup the PVC and storage class for the cluster. The examples would work on GKE directly.
gcloud container clusters create trillion-sort --num-nodes 3 --machine-type n1-standard-4
cd k8s
kubectl apply -f pvc.yaml
Generate the data.
./k8s-job-generate.sh
Sort the data
./k8s-job-sort.sh
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.4.5/install.yaml
In a production environment we would want to setup proper authentication. But for this demo we will use insecure mode.
kubectl patch deployment \
argo-server \
--namespace argo \
--type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/args", "value": [
"server",
"--auth-mode=server",
"--secure=false"
]},
{"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/httpGet/scheme", "value": "HTTP"}
]'
We can then access the argo UI using port forwarding.
kubectl -n argo port-forward deployment/argo-server 2746:2746
You can then access the UI at localhost:2746
Generate the data.
argo submit --watch workflow-generate.yaml
Sort the data
argo submit --watch workflow-sort.yaml
- Instead of using nfs file mount and send data over the network, we can use local SSDs and write to nfs only when necessary.
- We can use a faster filestore tier like high scale ssd which has write speed of upto 8.8GB/s. Write now our speed is about 100MB/s.
- Right now we are doing sequential sorts which is CPU bound. We can make use of GPU and sort in parallel.