- A Unix-based operating system (Mac/Linux)
- A Java 8 runtime environment (you can download java here)
- Google Cloud SDK (you can download GCS here)
- A MySQL database (create database cromwell)
- On your Google project, open up the API Manager and enable the following APIs
- Google Compute Engine
- Google Cloud Strorage
- Google Genomics API
This code tested on Cromwell v35.
# wget https://github.com/broadinstitute/cromwell/releases/download/36/cromwell-36.jar
# gcloud init
# gcloud auth login <login-id>
# gcloud auth application-default login
# gcloud config set project <google-cloud-project-id>
# git clone https://github.com/hongiiv/gatk-workflows.git
copy cromwell.jar
# cp cromwell-36.jar gatk-workflows/
# cd gatk-workfolws
This workflow calls another workflow, that second workflow is referred to as a sub-workflow. This workflow based on official GATK workflows
- seq-format-conversion: Workflows for converting between sequence data formats
- gatk4-data-processing: Workflows for processing high-throughput sequencing data for variant discovery with GATK4 and related tools
- gatk4-somatic-snvs-indes: Workflows for somatic shrot variant discovery with GATK4
# zip -r workflowDependencies.zip ./gatk4-data-processing/ ./gatk4-somatic-snvs-indels/ ./seq-format-conversion/
- MySQL database id/pw (user = your mysql id, password = your mysql p/w)
- Google project (project = your google project name)
- Base bucket (root = your google storage bucket name)
# java -Dconfig.file=google.conf -jar cromwell-35.jar run hello.wdl -o generic.google-papi.options.json
# java -Dconfig.file=google.conf -jar cromwell-35.jar \
submit FullPipeline.wdl \
-I FullPipeline.input.json \
-o generic.google-papi-options.json \
-h http://localhost:8080 \
-p workflowDependencies.zip
# java -Dconfig.file=google.conf -jar cromwell-35.jar \
run FullPipeline.wdl \
-I FullPipeline.input.json \
-o generic.google.papi-options.json \
# curl -X POST "http://localhost:8080/api/workflows/v1" \
-H "accept: application/json" -H "Content-Type: multipart/form-data" \
-F "workflowSource=@FullPipeline.wdl;type=" \
-F "workflowInputs=@FullPipeline.input.json;type=application/json" \
-F "workflowDependencies=@workflowDependencies.zip;type=application/zip"
You can access swagger web page (http://localhost:8080)
Q: Workflow를 실행하는 방법이 여러개가 존재합니다. 어떠한 방법을 사용하는것이 좋을까요?
A: Server 모드 (REST API 또는 submit 명령)를 통해 실행하는 것이 좋습니다. 해당 workflow를 관리가 가능하기 때문입니다.
Q: 왜 GATK workflow에서는 FASTQ 파일 대신 uBAM을 기본으로 사용합니까?
A: 최근 발표된 "Standards Guidelines for Validating Next-Generation Sequencing Bioinformatics Pipelines"에 따르면 laboratory, run, patient identifiers가 파일의 메타 데이터에 기록될 것을 권고하고 있습니다. 또한 파일 이름 자체에도 이러한 식별자가 표시되어야 한다고 권고하고 있습니다. 이러한 요구사항을 충족 시키기 위해서는 uBAM이 적합합니다.
Q: WDL을 지원하는 상업적인 서비스가 존재합니까?