Welcome to AI Peaks

From 0 to Pipeline using Kubernetes
Author: Pete George
Special Guest: Tom George

Whattaya cryin' about?

When I was a child, my father would collect every receipt he had in his wallet so he could reconcile them against his bank statement the following month.

What if he had a tool that would help him?
What if that tool could extract the text in those receipts for him?

Let's help Pete's dad!

We are going to build that tool today!

Step 1: Using Donut locally

Donut is a model that enables VDU (Visual Document Understanding). It consists of an image transformer encoder and an autoregressive text transformer decoder that is able to "understand" documents in a new way that does not involve OCR (Optical Character Recognition), along with the performance hits and error-prone methods.

Donut can be installed from Pypi using pip install donut-python.

Run test-metaflow.py on your computer to view the extracted text from receipt_test.jpg.

Step 2: Containerization

Install Docker Desktop

Docker Desktop can be found here

Install kubectl as a toolkit for running Kubernetes locally

Docker Desktop has its own version of kubectl, but follow these instructions if you need to install it manually in the future.

For the latest version of kubectl at the time of this workshop, run

curl.exe -LO "https://dl.k8s.io/release/v1.28.2/bin/windows/amd64/kubectl.exe"

Verify the checksum and test the installation, as described here.

Install Minikube as a local cluster management tool

Using Windows PowerShell as Administrator, download and run the installer

New-Item -Path 'c:\' -Name 'minikube' -ItemType Directory -Force
Invoke-WebRequest -OutFile 'c:\minikube\minikube.exe' -Uri 'https://github.com/kubernetes/minikube/releases/latest/download/minikube-windows-amd64.exe' -UseBasicParsing

And don't forget to add the location of the minikube.exe binary to your PATH

Build the docker image

cd to the local repository location (with the Dockerfile present) and build the docker image that will pull the latest version of Ubuntu from Docker Hub, install Python 3.12, install Donut, and configure the environment so it can run seemlessly. We're going to call this image aipeaks for now.

docker build -t aipeaks .

Let's test our image locally using Metaflow

Execute the docker image in a container

docker run -it aipeaks

In the container's terminal, set environment variables so we can connect to our S3 bucket, which contains our test images

export AWS_ACCESS_KEY_ID=my_aws_access_key_id
export AWS_SECRET_ACCESS_KEY=my_aws_secret_access_key

Now run flow.py, which will take 2 images from our S3 bucket and run Donut on them locally.

python3 flow.py run

Configuring Minikube

In a terminal, start Minikube

minikube start

For cluster monitoring, open a Minikube dashboard in a separate terminal. Keep this terminal session running for as long as you need monitoring as closing it will kill the dashboard.

minikube dashboard

Minikube comes with its own docker service. In the first terminal, execute

@FOR /f "tokens=*" %i IN ('minikube -p minikube docker-env --shell cmd') DO @%i

This will switch docker commands running in this terminal to run against the minikube cluster.

View the images available in the minikube cluster with docker image ls. Notice that our aipeaks image is not there. We can either push our aipeaks image to the minikube cluster, or we can build the aipeaks image directly inside the minikube cluster. For the purpose of simplicity, let's build our aipeaks image directly in the minikube cluster with docker build -t aipeaks . . Sometimes, the minikube cluster has a separate context that does not align with docker. If the image build fails, update the context using docker context use default so minikube can pick up the new context.

If you want to explore the former option, check this for more information.

We can now create jobs and deployments using kubectl. They will show up in the dashboard.

As an example, run kubectl create job aipeaks-job --image=aipeaks to create a job. Remember, this image is not optimized yet so your computer (and cluster) will need a lot of availabe memory to run it.

Step 3: Looking to the clouds

Metaflow is a great Data Science workflow tool that will automatically scale as Data Science Workflows are created and executed.

Outerbounds, the maintainer of Metaflow, publishes templates to deploy your stack in Azure, GCP, and AWS. They publish a variety of terraform files, helm charts, and YAML configuration files to make standing up cloud infrastructure stacks easy. As an example, checkout metaflow-cloudformation-setup.yaml to deploy a basic stack in AWS. This template is just one of the many available the public Metaflow Github Repository. Setup for the stack took about 10 minutes, which is pretty speedy for a basic sandbox.

In Sagemaker, comment out line 11 in flow.py

#urls = [obj.url for obj in s3.list_paths(['test_images'])]

and uncomment lines 12 and 17

urls = [obj.url for obj in s3.list_paths(['lotsa_images'])]

@batch(queue='job-queue-aipeaks' ...

and save it using Ctrl + S on your keyboard.

We're going to pull from the lotsa_images folder in our S3 bucket, which has 99 receipt images instead of 1 image. Metaflow will manage the cluster and scale the compute pods up and down depending on the load.

You could just as easily uncomment line 18 to run flow.py in EKS (Amazon Elastic Kubernetes Service).

And kick it off with python flow.py run

arch1904/aipeaks-pipeline-workshop