
A Django Database for holding genetic variants

VariantDatabase allows the following:

  1. Basic sample tracking capabilities. Organise Projects, Runs and Samples.
  2. Store variants that have been discovered.
  3. Parse and store run QC data from Illumina InterOp files.
  4. Parse and store sample QC data from SamStats.
  5. Allow searching for previously seen variants.
  6. View variants that have been found within a specific sample.
  7. Visualise variant annotation data.
  8. Integrates IGV.js to allow VCF and BAM viewing.
  9. Store the evidence and comments that Clinical Scientists make when analysing variants.

Getting Started




Python 2.7.11



Python Packages








To serve VCF and BAMs using IGV.js a webserver capable of HTTP range requests is required. Nginx is used in a typical deployment. Nginx is typically paired with Gunicorn which handles dynamic requests.

To annotate vcfs VEP is required (Tested on API and Cache Version 90):



Step 1 - Install Requirements

Within your python virtualenv type:

git clone https://github.com/WMRGL/VariantDatabase.git

pip install -r requirements.txt

Step 2 - Database Setup

python manage.py migrate

python manage.py makemigrations VariantDatabase

python manage.py migrate

python manage.py createsuperuser - follow instructions to create superuser.

Step 3 - Load initial data

python manage.py loaddata db_setup.json

Step 3 - Test

python manage.py test

Step 5 - Run

python manage.py runserver

Go to in your web browser to see welcome page.

Uploading Data


The main utility for uploading data into the database is the master_upload management function.

For help using this program type:

python manage.py master_upload -h

For example to upload all data for a worksheet (SampleSheet, Variants, Run QC, Sample QC, Gene Coverage and Exon Coverage) enter the following:

python manage.py master_upload --worksheet_dir /home/cuser/Documents/Project/DatabaseData/worksheet_dir/ --output_dir /home/cuser/Documents/Project/DatabaseData/MPN_213837/ --sample_sheet --run_qc --sample_qc --coverage --variants

It is important that the directories specified by the -w/ --worksheet_dir and output_dir/-o options are structured correctly.


The path to the worksheet directory.

This is the Illumina directory containing the file SampleSheet.csv. It should be structured as shown below. Only the files needed for the VariantDatabase to function correctly are shown. Folder and file names are case sensitive.

│   SampleSheet.csv
│   RunParameters.xml  
│   RunInfo.xml  
│   RunCompletionStatus.xml  
│   RunParameters.xml  
│   CompletedJobInfo.xml
│   GenerateFASTQRunStatistics.xml
│   │   ControlMetricsOut.bin
│   │   CorrectedIntMetricsOut.bin
│   │   ExtractionMetricsOut.bin
│   │   IndexMetricsOut.bin
│   │   QMetricsOut.bin
│   │   TileMetricsOut.bin


The path to the pipeline output directory.

The directory should be structured as shown below. Only the files needed for the VariantDatabase to function correctly are shown. Folder and file names are case sensitive.

* = wildcard

sample_name = The unique sample name specified in the SampleSheet.csv file

│   │   sample_name.bwa.drm.realn.sorted.bam
│   │   sample_name.bwa.drm.realn.sorted.bam.bai
│   │   ...
│   │
│   └───*QC_stats.zip
│       │
│       └───*QC_stats
│           │   sample_name.bwa.drm.realn.sorted.bam.stats
│           │   sample_name.bwa.drm.realn.sorted.bam.stats-acgt-cycles.png
│           │   sample_name.bwa.drm.realn.sorted.bam.stats-coverage.png
│           │   sample_name.bwa.drm.realn.sorted.bam.stats-gc-content.png
│           │   sample_name.bwa.drm.realn.sorted.bam.stats-gc-depth.png
│           │   sample_name.bwa.drm.realn.sorted.bam.stats-indel-cycles.png
│           │   sample_name.bwa.drm.realn.sorted.bam.stats-indel-dist.png
│           │   sample_name.bwa.drm.realn.sorted.bam.stats-insert-size.png
│           │   sample_name.bwa.drm.realn.sorted.bam.stats-quals.png
│           │   sample_name.bwa.drm.realn.sorted.bam.stats-quals2.png
│           │   sample_name.bwa.drm.realn.sorted.bam.stats-quals3.png
│           │   sample_name.bwa.drm.realn.sorted.bam.stats-quals-hm.png
│           │   ...
│   │   sample_name.exon-count-data.tsv.gz
│   │   sample_name.gene-count-data.tsv.gz
│   │   ...
│   │   sample_name*.vcf.gz
│   │   sample_name*.vcf.gz.tbi
│   │   ...


Add this option to import the information within the SampleSheet.csv file into the database. This will import a new worksheet and create sample objects as specified in the SampleSheet.


Add this option to import the run QC information into the database. This is the information contained within the InterOp files.


Add this option to import the sample QC information into the database. This is the information created by the SamStats program.


Add this option to import the Gene and Exon coverage information into the database.


Add this option to import the variant information contained within the VEP annotated vcf files.


Use this option to upload the data for a single sample. Example below:

python manage.py master_upload -worksheet_dir /home/cuser/Documents/Project/DatabaseData/worksheet_dir/ --output_dir /home/cuser/Documents/Project/DatabaseData/MPN_213837/ --sample_qc --coverage --variants --single_sample 213837-2-D17-26177-HP_S2

Note that the --sample_sheet and --run_qc options are not available when using the --single_sample option.

VCF Format

For the vcf files to be correctly parsed by the VariantDatabase parser (parsers/vcf_parser.py) they must be annotated by VEP.

Once VEP is installed annotate your vcfs with the following command:

vep -i input_vcf -o output.vcf --cache --fork 4 --refseq --vcf --flag_pick --exclude_predicted --everything --dont_skip --total_length --offline --fasta fasta_location

Other VCF annotations that are required include: INFO/Caller, FORMAT/AD, INFO/TCF, INFO/TCR and INFO/VAFS

They can then be bgzipped in preparation for database import:

bgzip file_name

tabix file_name.gz

User Guide

  • Joseph Halstead