/snakemake-aks-tutorial

This is a record of my efforts to get snakemake to run on Azure Kubernetes Service. Works as of 2023-01-31

Primary LanguagePython

Running snakemake using Azure kubernetes service

This walkthrough works as of 2023-01-31. Unfortunately, future changes to either Azure Kubernetes Service or snakemake may prevent this walkthrough from working. However, this will serve as a record of my debugging efforts to help any poor soul that attempts using Azure Kubernetes. Good luck and may Bill Gates have mercy on you.

Contents

Main changes from official snakemake tutorial

References

  1. Set up linux environment to lauch kubernetes

  2. Create a conda environment with needed dependencies

  3. Create a storage account

  4. Get example data from this repo

  5. Create auto-scaling kubernetes cluster

  6. Optionally build our own docker image of snakemake

  7. Run the workflow

Main changes from official snakemake tutorial

A snakemake tutorial repo can be found here, but it doesn't work for AKS out of the box.

I modified this repo in the following ways:

  • copied this workflow to a Snakefile

  • Added a scripts/plot-quals.py script from here

  • Removed data/samples/C.fastq because it was unecessary to provide a minimal snakemake example and also not included in Snakefile

  • Added conda: "enviornment.yaml" to each snakemake rule in the Snakefile so that kubernetes would install software on the fly

  • Added genome BWA index to input of the bwa_map rule. BWA mem needs the index in order to run, but kubernetes won't download the index to its nodes unless the index is defined in the input.

  • Created a custom docker image with snakemake, kubernetes, and azure-blob-storage installed which kubernetes can use

The main lessons I learned are:

  • A kubernetes job will only download files to the compute nodes if they are defined in the snakemake rule

  • The snakemake image that kubernetes uses needs the azure-blob-storage dependency in order to work

References

The material in this walkthrough was compiled from the following links:

azure-specific snakemake+kubernetes tutorial

generic snakemake+kubernetes tutorial

install azure-cli

Dockerfile for latest snakemake image

How to replicate conda environment in a Dockerfile

Video of how to build image and push it to Docker hub

And a few github issues:

Set up linux environment to lauch kubernetes

First we need a computer from which to lauch kubernetes. I'm running windows locally and the tutorials above suggest I need the azure-cli installed. I'm guessing my windows subsystem for linux would cause problems for this, so I'm thinking I should just get an azure data science virtual machine (which includes azure-cli) in the cloud to save myself the hassel.

create linux vm in azure portal

Find data science virtaul machine under All services > Create a resource > Marketplace, then use the following specifications:

resource group: ccf22_robe1195 (or whatever you want)

vitual machine name: ccf22robe1195snakemaketest1 (or whatever you want)

image: data science virtual machine ubuntu 20.04

size: b2s, 2 cpus, 4 GB

username: robe1195 (or whatever you want)

password: **************************************** (40 characters, randomly generated)

os disk type: standard ssd

tag name: mainproject

value: test

log into vm from local terminal (for me: a windows subsystem for linux)

Get IP address by clicking on the VM resource

Log on and enter password with ssh robe1195@<insert IP address>. You may need to wait a bit before connection is accepted

Check if conda is installed with conda --help

Check if azure-cli is installed with az --help

Check if docker is installed with docker --help

Create a conda environment with needed dependencies

Once you log into your VM, you need to get snakemake, azure-blob-storage, and kubernetes modules all in the same environment.

First create the conda environment and install snakemake:

conda create -c bioconda -c conda-forge -n snakemake snakemake

Now install other dependencies into the snakemake environment

# activate snakemake environment
conda activate snakemake

# install kubernetes
# For some reason, installing via both pip and conda prevents later issues
conda install -c conda-forge kubernetes
pip install kubernetes

# install software for interacting with blob storage
conda install -c conda-forge azure-storage-blob

Again, Azure CLI should already be installed because I'm using the DSVM, so that saves one huge headache

Create a storage account

We want snakemake and kubernetes to write to blob storage, so we first need to create a storage account.

You could do this manually in the portal too or use the following commands

# change the following names as required
# azure region where to run:
region=northcentralus
# name of the resource group to create (just use the same one assigned to me during the fellowship)
resgroup=ccf22_robe1195
# name of storage account to create (all lowercase, no hyphens etc.):
stgacct=ccf22robe1195snakekub

# create a resource group with name and in region as defined above
# I don't need to do this because I already have a resource group under the cloud fellowship subscription
# az group create --name $resgroup --location $region

# loging to azure, run the below code then follow the steps
az login --use-device-code

# create a general purpose storage account with cheapest SKU
az storage account create -n $stgacct -g $resgroup --sku Standard_LRS -l $region

# get storage account key for later use
# the below line is in the tutorial, but doesn't work and the tutorial contains a typo ($storageacct -> $stgacct)
# stgkey=$(az storage account keys list -g $resgroup -n $stgacct | head -n1 | cut -f 3)
# Instead we'll use json file processing and remove quotes to get the key
stgkey=$(az storage account keys list -g $resgroup -n $stgacct | jq .[1].value | sed 's/^"//' | sed 's/"$//')

# finally, create the storage container
az storage container create --resource-group $resgroup --account-name $stgacct --account-key $stgkey --name snakemake-tutorial

Get example data from this repo

Download this repo to your VM, then upload the data/ portion to your storage account so that kubernetes can access it

mkdir tutorial
cd tutorial
git https://github.com/milesroberts-123/snakemake-aks-tutorial.git
cd snakemake-aks-tutorial
az storage blob upload-batch -d snakemake-tutorial --account-name $stgacct --account-key $stgkey -s data/ --destination-path data

Create auto-scaling kubernetes cluster

# change the cluster name as you like
# needed to add --generate-ssh-keys to tutorial command
clustername=snakemaks-aks
az aks create --generate-ssh-keys --resource-group $resgroup --name $clustername --vm-set-type VirtualMachineScaleSets --load-balancer-sku standard --enable-cluster-autoscaler --node-count 1 --min-count 1 --max-count 3 --node-vm-size Standard_B4ms

Now fetch credentials for cluster so that you can interact with it

# get credentials
az aks get-credentials --resource-group $resgroup --name $clustername

# print basic cluster info
kubectl cluster-info

Optionally build our own docker image of snakemake

I already did this step and pushed the result to dockerhub, so you can technically skip it

An important lesson I learned is that the snakemake version on which your lauch vm needs to match the snakemake container used in kubernetes. However, a problem with the current snakemake docker image is that it excludes the azure-blob-storage dependency. Thus, I built a new snakemake image that has this dependency installed.

I started by exporting my current snakemake environment, which already had the needed dependencies (snakemake, azure-blob-storage, and kubernetes)

cd build_snakemake_image
conda env export > snakemake-docker-environment.yaml

Then copied the environment over to a Dockerfile that looks like this:

# Start from ubuntu base image, make sure its updated so packages work, then install basic necessities
FROM ubuntu:20.04

RUN apt-get update --no-install-recommends --assume-yes && apt-get upgrade --no-install-recommends --assume-yes

RUN apt-get install --no-install-recommends --assume-yes wget curl bzip2 ca-certificates gnupg2 squashfs-tools git

# Add opt/conda to environment path
ENV PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

# Install mamba to speed up conda
# I needed to remove the strict channel priorities from the system config to get all of the packages installed. I'm not sure if this will cause future problems
RUN curl -L https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh > mambaforge.sh
RUN bash mambaforge.sh -b -p /opt/conda
RUN conda config --system
RUN rm mambaforge.sh

# copy conda environment into docker container
ADD ./snakemake-docker-environment.yaml .
RUN mamba env update --file ./snakemake-docker-environment.yaml && conda clean -tipy

# activate snakemake environment upon starting container, so that installed software is accessible to kubernetes
RUN echo "source activate snakemake" > ~/.bashrc
ENV PATH=/opt/conda/envs/snakemake/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

Now build the docker image

docker build -t ccf22robe1195snakemake:latest .

Now I needed to push the image to dockerhub after logging into docker

First I needed to make the snakemake-aks repo on docker hub, then execute:

docker tag ccf22robe1195snakemake milesroberts/snakemake-aks
docker logout
docker login
docker push milesroberts/snakemake-aks

Finally return to the directory with the Snakefile with cd ..

Run the workflow

Export storage account and key information, then call snakemake with --kubernetes using our custom snakemake image

export AZ_BLOB_ACCOUNT_URL="https://${stgacct}.blob.core.windows.net"

export AZ_BLOB_CREDENTIAL="$stgkey"

snakemake --kubernetes --container-image docker.io/milesroberts/snakemake-aks:latest --default-remote-prefix snakemake-tutorial --default-remote-provider AzBlob --envvars AZ_BLOB_ACCOUNT_URL AZ_BLOB_CREDENTIAL --use-conda --jobs 3