Genomics pipelines. At scale. With Spark and Glow. 🤯
Spark based pipelines for:
- Variant calling (built on GATK's HaplotypeCaller)
- Somatic variant calling (built on MuTect2)
- Joint genotyping (built on GenotypeGVCFs)
- Clone the repo
- Unpack the big test files archive located in the project root
tar -xf big-files.tar.gz
sbt test
- Create an init script to download the reference genome from cloud storage (see
hls.sh
orprepare_reference.py
for inspiration. - Build an uber jar (
sbt assembly
) - Create a cluster with the init script from step 1 and attach the assembly jar.
- Run the desired pipeline using one of the attached notebooks.
This is not an official Databricks product. This project is released without an expectation of continued development or maintenance.