Running

You need a cluster with Spark installed, e.g. CDH 5.7.0.

First clone this repository.

git clone https://github.com/tomwhite/gatk-spark-workflows
cd gatk-spark-workflows

Copy a small BAM file for testing into HDFS:

hadoop fs -put data/CEUTrio.HiSeq.WGS.b37.ch20.1m-1m1k.NA12878.bam \
 bam/CEUTrio.HiSeq.WGS.b37.ch20.1m-1m1k.NA12878.bam

Edit MarkDuplicatesInputs.json to reflect your local paths.

Now download the Cromwell JAR:

wget https://github.com/broadinstitute/cromwell/releases/download/0.19.3/cromwell-0.19.jar

Finally, run the workflow.

java -jar cromwell-0.19.jar run MarkDuplicates.wdl MarkDuplicatesInputs.json

The output should be visible in HDFS:

hadoop fs -ls out/markdups