Serverless Variant Calling

Datasets

  1. Trypanosome
  1. Human
  1. Bos taurus

*The pipeline currently support only single-end sequences. Use only the first sequence for paired-end reads.

Setup

  1. Setup Lithops for AWS backend.

  2. Build the runtime in the dockerfile directory :

$ lithops runtime build -f lambda.Dockerfile serverless-genomics:1
  1. Configure Lithops to use the built runtime (e.g. serverless-genomics:1). The required runtime memory, ephemeral disk size and timeout will depend on the number of partitions and

  2. Create an S3 bucket with the required input datasets.

  3. Use VariantCallingPipeline to execute the pipeline. Provide the necessary parameters. You can see a complete list in file serverlessgenomics/pipeline.py. The required parameters are:

    • run_id: ID of a specific run. It can be reused for failed runs. You must change the ID if different input data are used.
    • fasta_path: Path of the input sequence read (s3://...).
    • fasta_chunks: Number of FASTA partitions.
    • fastq_path: Path of the input reference genome (s3://...).
    • fastq_chunks: Number of FASTQ partitions.
    • storage_bucket: Temporary data S3 bucket name. It must exist.
  4. Call VariantCallingPipeline.preprocess() for the pre-processing stage, VariantCallingPipeline.alignment() to run the alignment phase, VariantCallingPipeline.reduce() for the alignment phase or VariantCallingPipeline.run_pipeline() for a complete execution.

Article

You can read more about this pipeline in the published article Scaling a Variant Calling Genomics Pipeline with FaaS, presented in WoSC '23: Proceedings of the 9th International Workshop on Serverless Computing, part of MIDDLEWARE 2023 24th ACM/IFIP International Middleware Conference: https://dl.acm.org/doi/10.1145/3631295.3631403 (Preprint)