The GIAB benchmark set development framework is a snakemake bioinformatic pipeline for the development of transparent and reproducible genome assembly based small and structural variant benchmark sets. This pipeline is in activate development. Releases are used to capture version of code based used to generate shared versions of our draft benchmarks sets.
Detailed diagram by Jenny https://lucid.app/lucidchart/a1feb68c-838b-4851-8259-8289d8cd5c53/edit?invitationId=inv_977874a3-d753-4518-869d-3f0a8ca5eb2c&page=0_0# High-level diagram by Nate -https://lucid.app/lucidchart/aea8aae0-c550-420d-80df-95b73c0cc840/edit?page=0_0#
The following usage documentation is for running the pipeline locally using mamba on a linux system (requirement for hap.py). Please use these instructions as a starting point for running the pipeline locally. Contact Nathan Olson at nolson@nist.gov, with questions or submit an issue, if you are unable to run the pipeline locally, or if you have other questions about the pipeline. This pipeline was developed and maintained primarily for use in generating benchmark sets for the GIAB RMs by the NIST-GIAB team. The code is provided for transparency in the benchmark set development process.
- clone repository
git clone https://github.com/usnistgov/giab-defrabb.git
- generate conda snakemake environment
mamba create -n snakemake --file envs/env.yml
- Activate environment
mamba activate snakemake
- Run built in test analyses
snakemake --use-conda -j1
- Define analysis runs by editing
config/analyses.tsv
- Update
config/resources.yml
as necessary. - Run pipeline using
snakemake --use-conda -j1
usage: run_defrabb [-h] -r RUNID [-a ANALYSES] [-o OUTDIR] [-j JOBS] [--archive_dir ARCHIVE_DIR]
[--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [-s]
DeFrABB wrapper script for executing and archiving framework
options:
-h, --help show this help message and exit
-r RUNID, --runid RUNID
Analysis RUN ID, please use following naming convention YYYYMMDD_milestone_brief-id
-a ANALYSES, --analyses ANALYSES
defrabb run analyses table
-o OUTDIR, --outdir OUTDIR
Output directory
-j JOBS, --jobs JOBS Number of jobs used by snakemake
--archive_dir ARCHIVE_DIR
Directory to copy pipeline run output to for release. Primarily intended for internal NIST use.
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
workflow steps:
-s , --steps Defining which workflow steps are run:
all: pipe, report, and archive (default)
pipe: just the snakemake pipeline
report: generating the snakemake run report
archive: generating snakemake archive tarball
release: copy run output to NAS for upload to Google Drive (internal NIST use-case)
Any additional arguments provided will be passed directly to Snakemake.
Create specified directory move to new directory e.g.
run_id
- YYYYMMDD_milestone_brief-id, where milestone is the defrabb version
tag and brief-id is a brief description (few words separated by '-') of the analysis.
## Directory for run
DIR=/defrabb_runs/runs_in_progress/{run_id}
mkdir $DIR
cd $DIR
## Clone repo
git clone git@gitlab.nist.gov:bbd-human-genomics/defrabb.git .
Update config and run using
./run_defrabb \
-r {run_id} \
-o ../ -s pipe
Will not need to provide -a
if using analysis table if using table that
follows the run id naming convention.
Run the pipeline with -s all
or the rest of the steps individually to generate
the report, snakemake archive, and share (release) the run.
- (For NIST internal run documentation) Fill out README with relevant run information - framework repo info - [milestone] tag (with some potential - hopefully minor-differences), who ran the framework and where/ how, justification / reasoning for analyses, JZ notes (what did we learn), use defrabb run README template.
- (For NIST internal run documentation) Add run information to the defrabb run log spreadsheet
The default output
and release
directories in the run_defrabb
script are configured specifically for internal use.
The output
directory can be provided as a command line argument.
The release
directory is hard coded to copy files to the NIST-GAIB team NAS.
You will need to modify the output directory path to a path that is appropriate for your setup.
Output directory structure
.
├── [RUN ID].archive.tar.gz - pipeline archive with code, dependencies, and input generated by snakemake
├── [RUN ID].report.zip - snakemake run report, include interactive html with run information and some results
├── analyses_[RUN ID].tsv - config table defining snakemake pipeline run
├── resources.yml - config file with urls for external inputs and rule parameters
├── benchmark - rule run time along with cpu and memory usage information
├── logs - log file for individual rules
├── defrabb_environment.yml - conda env file with high level environment used to run pipeline
├── resources - externally sourced input files
│ ├── assemblies - assemblies used for benchmark generation
│ ├── comparison_variant_callsets - comparison callsets and regions use in benchmark evaluations
│ ├── exclusions - bed files used to define benchmark exclusion regions
│ ├── references - reference genomes assemblies compared to
│ └── strats - GIAB stratifications used to stratify evaluation results
├── results
│ ├── asm_varcalls - reference v. assembly comparisons including vcf annotations
│ ├── draft_benchmarksets - benchmark regions and variants along with intermediate files
│ ├── evaluations - benchmarking draft benchmark sets to comparison callsets
│ └── report - summary stats and metric used in snakemake report
└── run.log - run_defrabb.sh log file
A small dataset with chr 21 for GRCh38 1 was developed for testing and development.
The resources are either included in the repository or are hosted in a GIAB S3 bucket with appropriate urls included in the resources.yml
small example datasets are included / made available for testing framework code.
Test pipeline for runtime errors using snakemake --use-conda -j1
Unit tests for individual python functions and snakemake rules are implemented with python unittest and pytest respectively.
The pytest unit tests were initially generated using snakemake --generate-unit-tests
functionality.
The test scripts were modified as needed;
removing unnecessary tests, including config directory, modifying commands for appropriate inputs, and limiting the number of test data files for smaller tests.
Additional modifications were made for bam and vcf comparisons, specifically ignoring file headers as the metadata for the test and expected files are not consistent.
The functions need to be in a .py file.
- Copy
rules/common.smk
totest/common.py
for running tests. - Run tests using
python -m unittest test/common.py test/unit/config.py
- Tests are run using
pytest .tests
- Tests assume
GRCh38_chr21.fa
andGRCh38_chr21.fa.fai
are in.tests/integration/resources/references
. Not including these files in the repository for now to avoid including large data files in repo, therefore these files might need to be downloaded before running tests.
## pytest
pytest .tests
## Snakemake pipeline
snakemake --use-conda -j 1 --forceall
## Larger snakemake analysis run set
snakemake --use-conda -j 1 --forceall --config analyses=config/analyses_fulltest.tsv
## Test using run_defrabb script
./run_defrabb -r 20240513_v0.016_test -a config/analyses.tsv -s pipe
Footnotes
-
Chromosome 13 was included in the test dataset reference as dipcall incorrectly included line breaks in dip.vcf when only chr21 was included. We might want to submit an issue to the dipcall repo about this issue. ↩