/ncbench-workflow

Primary LanguagePythonMIT LicenseMIT

NCBench continuous small variants benchmarking workflow.

Snakemake DOI

A Snakemake workflow for benchmarking callsets of small genomic variants, using popular benchmark datasets like Genome in a Bottle or CHM-eval. A detailed description of the workflow, also outlining all involved insights and design decisions can be found under https://doi.org/10.12688/f1000research.140344.1.

Contributing callsets

  1. Download raw data:
  • Germline:

    dataset link
    NA12878 Agilent (75M and 200M reads): DOI
    NA12878 Twist (restricted access but you can ask for it via the zenodo interface): DOI
    CHM: URL
  • Somatic:

    dataset SRA ID tumor fastq link tumor bam SRA ID normal fastq link normal bam
    SEQC2 WES SRR7890918 Website SRR7890919 Website
    SEQC2 WGS SRR7890893 Website SRR7890943 Website
    SEQC2 FFPE SRR7890933 Website SRR7890951 Website
  1. Run your pipeline on it.
  2. Upload results (VCF or BCF) to zenodo.
  3. Create a pull request that adds your results to the config file, under variant-calls. Thereby, comply to the following structure:
    my-callset: # choose a descriptive name for your callset
     labels:
       site: # name of your institute, group, department etc.
       pipeline: # name of the pipeline
       trimming: # tool used to trim reads
       read-mapping: # used read mapper
       base-quality-recalibration: # base recalibration method (remove if unused)
       realignment: # realignment method (remove if unused)
       variant-detection: # variant callers (provide comma-separated list if multiple ones are used)
       genotyping: # genotyper/event-typer used
       url: # URL of used pipeline
       # add any additional relevant attributes (they will appear in the false positive and false negative tables of the online report)
     subcategory: # category of callsets to include this one (see other entries in the config file and align with them if possible)
     zenodo:
       deposition: # zenodo record id (e.g. 7734975)
       filename: # name of vcf/bcf/vcf.gz file in the zenodo record
     benchmark: # benchmark to use (one of giab-NA12878-agilent-200M, giab-NA12878-agilent-75M, giab-NA12878-twist, and more, see https://github.com/snakemake-workflows/dna-seq-benchmark/blob/main/workflow/resources/presets.yaml)
     rename-contigs: resources/rename-contigs/ucsc-to-ensembl.txt # rename contigs from UCSC (prefixed with chr) to Ensembl style (remove if your contigs are already in Ensembl style)
  4. The pull request will be automatically executed with the ncbench workflow and you will be able to download the resulting report with the assessment of your callset as an artifact from the github actions CI interface.
  5. Once the pull request has been reviewed and merged, your results will appear in the online report at https://ncbench.github.io.
  6. If your callset receives an update, update your zenodo record and create a new pull request that updates the zenodo record ID in your config entry.

Checking out results

The latest results for all contributed callsets are shown at https://ncbench.github.io.

Running ncbench locally

For running ncbench locally, the following steps are required:

  1. Mamba and Install snakemake.
  2. Clone this git repository
  3. Adapt the configuration according to your needs (e.g. add your own callset, and maybe remove all the other callsets if you are only interested in your own). Whn adding your own callset, you can either refer to a zenodo repository, but also (which in the local case is probably more useful, refer to a local path. The following is a minimal entry for evaluating a local callset, to be added to the variant-calls section in the file config/config.yaml of your local clone:
    my-callset: # choose a descriptive name for your callset
     path: # path to vcf/bcf/vcf.gz file containing your variant calls (both SNVs and indels, sorted by coordinate)
     benchmark: # benchmark to use (one of giab-NA12878-agilent-200M, giab-NA12878-agilent-75M, giab-NA12878-twist, and more, see https://github.com/snakemake-workflows/dna-seq-benchmark/blob/main/workflow/resources/presets.yaml)
     rename-contigs: resources/rename-contigs/ucsc-to-ensembl.txt # rename contigs from UCSC (prefixed with chr) to Ensembl style (remove if your contigs are already in Ensembl style)
  4. Run the workflow, first in dryrun mode with snakemake -n --sdm conda and then in reality with snakemake --sdm conda --cores N with N being your desired number of cores. You can also run it on cluster or cloud middleware. The Snakemake documentation provides all the details.