NPM-sample-qc is a Nextflow workflow to obtain QC metrics from single-sample WGS results. It has been created to support QC efforts within the National Precision Medicine programme in Singapore (NPM), a local initiative that intends to sequence the genomes of 100K individuals (SG100K), but can be easily extended to other large-scale sequencing projects.
Here an example command to launch the workflow:
nextflow run main.nf \ -config conf/nextflow.config \ -params-file tests/sample_params.yml \ -work-dir ./work \ --outdir ./ \ --keep_workdir
Please refer to the workflow help for more information on its usage and access to additional options:
nextflow run main.nf --help
NPM-sample-qc input requirements can be split into two categories:
- Generic workflow settings specify parameters that will not vary from run to run, e.g. Nextflow profile declarations, trace/timeline/dag options, output structure and paths to data resources. See
conf/nextflow.config
for additional details. - Sample-specific settings contain paths to WGS results for a given sample, namely CRAM and VCF/gVCFs. Optionally, you can also provide a positive sample tracking VCF (
pst_vcf
) to calculate genotype concordance against your WGS VCF (see the Metric definitions section). Seetests/sample_params.yml
for an example.
NPM users will only be required to update sample_params.yml
to match their sample of interest. In addition, any user can update the global settings by following standard Nextflow configuration guidelines.
Upon completion, NPM-sample-qc will create the following files in the outdir
directory:
/path/to/outdir/ pipeline_info/ # dag, timeline and trace files dag.pdf timeline.html trace.txt results/ # final metrics.json and intermediate outputs <sample_id>.metrics.json bcftools/ count_variants/ verifybamid2/ samtools/ plot_bamstats/ picard/ multiqc/
If keep_workdir
has been specified, the contents of the Nextflow work directory (work-dir
) will also be preserved.
We provide a schematic representation of the workflow in the figure below:
In a nutshell, NPM-sample-qc generates QC metrics from single-sample WGS results in three stages: metrics calculation, parsing of intermediate outputs and generation of a final report. This makes it possible to take full advantage of the parallelisation capabilities of Nextflow, allows users to leverage third-party tools or add custom scripts, and enables auto-documentation of metrics from code comments.
Metrics calculation
The current workflow combines widely-used third-party tools (samtools, bcftools, picard, VerifyBamID2) and custom scripts (e.g. count_variants.py) to obtain a rich set of QC metrics. Full details on which processes are run/when can be found in the actual workflow definition (main.nf
). We also provide an example dag for a more visual representation (tests/example_dag.pdf
).
Metrics parsing
Next, output files from each individual tool are parsed and combined into a single json file. This is done by calling MultiQC_NPM, a MultiQC plugin that extends the base tool to support additional files (e.g. outputs from bcftools gtcheck, picard CollectQualityYieldMetrics and count_variants.py).
Metrics reporting
Finally, the contents of the MultiQC json are formatted into a final metrics report, also in json format. The reporting logic lives in the compile_metrics.py script, and whilst its contents are simple, it enables automatic documentation of metric definitions from code comments (see the Metric definitions section).
The full list of metrics reported by the NPM-sample-qc workflow and details on how they've been calculated can be found here.
When needed, page contents can be updated by running the following command:
cd docsrc; ./build.sh