Structural Variant Calling Benchmark

This benchmark is based on the publicly available long-read sequencing data (i.e., Oxford Nanopore PromethION long-reads) of the Ashkenazim son HG002/NA24385. I provide each step how to reproduce the final metrics with publicly available tools.

Thanks for armintoepfer's great works!

Comparison

Coverage titration for 5, 10, 20 and 47-fold:

Fig 1. Recall & Precision

Fig 2. GT-Recall & GT-Precision

At 5-fold coverage:

Method	F1 %	Precision %	Recall %	GT-F1 %	GT-Precision %	GT-Recall %
pbsv	63.20	75.45	54.37	55.33	62.46	49.66
svim	75.25	78.43	72.33	50.40	43.80	59.34
sniffles	73.89	83.74	66.11	48.82	46.15	51.81
cuteSV	81.74	90.63	74.44	73.66	76.45	71.07

At 10-fold coverage:

Method	F1 %	Precision %	Recall %	GT-F1 %	GT-Precision %	GT-Recall %
pbsv	79.99	81.86	78.20	73.23	70.96	75.66
svim	81.55	81.11	81.99	74.60	70.09	79.74
sniffles	80.80	87.66	74.94	56.90	51.42	63.69
cuteSV	88.85	93.07	85.00	83.83	84.00	83.65

At 20-fold coverage:

Method	F1 %	Precision %	Recall %	GT-F1 %	GT-Precision %	GT-Recall %
pbsv	85.98	88.60	83.52	79.65	77.75	81.64
svim	83.35	76.60	91.39	77.26	67.49	90.34
sniffles	84.78	84.78	84.78	62.37	52.21	77.43
cuteSV	93.34	92.22	94.49	90.13	86.44	94.15

At 47-fold coverage: (pbsv crashed!)

Method	F1 %	Precision %	Recall %	GT-F1 %	GT-Precision %	GT-Recall %
pbsv	--	--	--	--	--	--
svim	88.74	85.95	91.72	82.82	76.16	90.76
sniffles	86.98	84.63	89.46	64.78	52.69	84.09
cuteSV	94.33	92.15	96.61	91.98	87.91	96.45

Get tools

Information how to install conda and add the bioconda channel is available on https://bioconda.github.io/.

conda create --name sv python=3
source activate sv
conda install minimap2==2.17 pbsv==2.2.0 svim==1.2.0 sniffles==1.0.11 cuteSV==1.0.3 truvari==1.2 samtools bgzip tabix

Get data

Create directory structure:

mkdir -p fastqs ref alns tools/{pbsv,sniffles,svim,cuteSV} giab

Download genome in a bottle annotations:

FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/
curl -s ${FTPDIR}/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.bed > giab/HG002_SVs_Tier1_v0.6.bed
curl -s ${FTPDIR}/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz > giab/HG002_SVs_Tier1_v0.6.vcf.gz

Download hg19 reference with decoys and map non-ACGT characters to N:

curl -s ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz > ref/human_hs37d5.fasta.gz
gunzip ref/human_hs37d5.fasta.gz
sed -i '/^[^>]/ y/BDEFHIJKLMNOPQRSUVWXYZbdefhijklmnopqrsuvwxyz/NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN/' ref/human_hs37d5.fasta

Download hg19 tandem repeat annotations:

curl -s https://raw.githubusercontent.com/PacificBiosciences/pbsv/master/annotations/human_hs37d5.trf.bed > ref/human_hs37d5.trf.bed

Download all .fastq files:

FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/UCSC_Ultralong_OxfordNanopore_Promethion/
for fastq in $(curl -s -l ${FTPDIR} | grep '.fastq'); do curl -s ${FTPDIR}${fastq} > fastqs/${fastq}; done
for fastq in `ls fastqs`; do gunzip fastqs/${fastq}; done

Alignment

Align each movie:

for i in `ls fastqs/`;do
    minimap2 ref/human_hs37d5.fasta fastq/$i -a -z 600,200 -x map-ont \
        --MD -Y -o alns/$i.sam -R '@RG\tID:hg2'
done

Alignments sorting and merging:

for i in `ls alns/`;do
    samtools view -buS alns/$i | samtools sort -O bam -T ./ - > alns/$i.bam
done
for i in `ls alns/*.bam`; do echo $i; done > input_bam.fofn
samtools merge alns/GM24385_all.bam -b input_bam.fofn && samtools index alns/GM24385_all.bam

Run pbsv

8a) Discover SV signatures for each alignment, can be done in parallel:

pbsv discover alns/GM24385_all.bam tools/pbsv/hg2_ont.svsig.gz" --tandem-repeats ref/human_hs37d5.trf.bed

8b) Call and polish SVs:

pbsv call ref/human_hs37d5.fasta tools/pbsv/hg2_ont.svsig.gz tools/pbsv/hg2_ont.pbsv.vcf -t INS,DEL
bgzip tools/pbsv/hg2_ont.pbsv.vcf
tabix tools/pbsv/hg2_ont.pbsv.vcf.gz

Run svim

9a) Run svim:

svim alignment tools/svim alns/GM24385_all.bam ref/human_hs37d5.fasta --min_sv_size 30

9b) Prepare for truvari:

cat tools/svim/final_results.vcf \
    | sed 's/INS:NOVEL/INS/g' \
    | sed 's/DUP:INT/INS/g' \
    | sed 's/DUP:TANDEM/INS/g' \
    | awk '{ if($1 ~ /^#/) { print $0 } else { if($5=="<DEL>" || $5=="<INS>") { print $0 }}}' \
    | grep -v 'SUPPORT=1;\|SUPPORT=2;\|SUPPORT=3;\|SUPPORT=4;\|SUPPORT=5;\|SUPPORT=6;\|SUPPORT=7;\|SUPPORT=8;\|SUPPORT=9;' \
    | sed 's/q5/PASS/g' > tools/svim/hg2_ont.svim.vcf
bgzip tools/svim/hg2_ont.svim.vcf
tabix tools/svim/hg2_ont.svim.vcf.gz

Run sniffles

10a) Run sniffles:

sniffles -s 10 -l 30 -m alns/GM24385_all.bam -v tools/sniffles/hg2_ont.sniffles.vcf --genotype

10b) Prepare for truvari:

cat <(cat tools/sniffles/hg2_ont.sniffles.vcf | grep "^#") \
    <(cat tools/sniffles/hg2_ont.sniffles.vcf | grep -vE "^#" | \
      grep 'DUP\|INS\|DEL' | sed 's/DUP/INS/g' | sort -k1,1 -k2,2g) \
    | bgzip -c > tools/sniffles/hg2_ont.sniffles.vcf.gz
tabix tools/sniffles/hg2_ont.sniffles.vcf.gz

Run cuteSV

11a) Run cuteSV:

cuteSV alns/GM24385_all.bam tools/cuteSV/hg2_ont.cuteSV.vcf ./ -s 10 -l 30

11b) Prepare for truvari:

grep -v 'INV\|DUP\|BND' tools/cuteSV/hg2_ont.cuteSV.vcf | bgzip -c > tools/cuteSV/hg2_ont.cuteSV.vcf.gz
tabix tools/cuteSV/hg2_ont.cuteSV.vcf.gz

Final comparison

Compare to ground truth:

truvari -f ref/human_hs37d5.fasta -b giab/HG002_SVs_Tier1_v0.6.vcf.gz\
        --includebed giab/HG002_SVs_Tier1_v0.6.bed -o bench-pbsv --passonly\
        --giabreport -r 1000 -p 0.00 -c tools/pbsv/hg2.pbsv.vcf.gz
truvari -f ref/human_hs37d5.fasta -b giab/HG002_SVs_Tier1_v0.6.vcf.gz\
        --includebed giab/HG002_SVs_Tier1_v0.6.bed -o bench-svim --passonly\
        --giabreport -r 1000 -p 0.00 -c tools/svim/hg2.svim.vcf.gz
truvari -f ref/human_hs37d5.fasta -b giab/HG002_SVs_Tier1_v0.6.vcf.gz\
        --includebed giab/HG002_SVs_Tier1_v0.6.bed -o bench-sniffles --passonly\
        --giabreport -r 1000 -p 0.00 -c tools/sniffles/hg2.sniffles.vcf.gz
truvari -f ref/human_hs37d5.fasta -b giab/HG002_SVs_Tier1_v0.6.vcf.gz\
        --includebed giab/HG002_SVs_Tier1_v0.6.bed -o bench-cuteSV --passonly\
        --giabreport -r 1000 -p 0.00 -c tools/cuteSV/hg2_ont.cuteSV.vcf.gz

Parse results:

function sumsv() { cat $1 | grep ':' | tr -d ',' |sed "s/^[ \t]*//"| tr -d '"' |\
                   tr -d ' ' | tr ':' '\t' | awk '{ for (i=1; i<=NF; i++)  {
                     a[NR,i] = $i } } NF>p { p = NF } END { for(j=1; j<=p; j++)
                     { str=a[1,j]; for(i=2; i<=NR; i++){ str=str" "a[i,j]; } print str } }' |\
                   tail -n 1 | awk '{ printf "%1.4f\t%1.4f\t%1.4f\t%10.0f\t%10.0f\t%10.0f\n", $2,$4,$11,$1,$8,$1+$8 }';}
cat <(echo -e "Run\tF1\tPrecision\tRecall\tFP\tFN\tFP+FN")\
    <(cat <(for i in bench*; do printf $i"\t";sumsv $i/summary.txt;done) |\
    sed 's/bench//g;s/-//g' | sort -k 2 -n -r) | column -t

Or for markdown:

cat <(echo -e "Run\tF1\tPrecision\tRecall\tFP\tFN\tFP+FN")\
    <(echo -e ":-:\t:-:\t:-:\t:-:\t:-:\t:-:\t:-:")\
    <(cat <(for i in bench*; do printf $i"\t";sumsv $i/summary.txt;done) |\
    sed 's/bench//g;s/-//g' | sort -k 2 -n -r) | tr '\t' ' ' | tr -s ' ' | tr ' ' '|' | awk '{print "|"$0"|"}'

Coverage titrations

Create a new base folder for each coverage and subsample the merged BAM:

samtools view -bS -s 0.106 alns/GM24385_all.bam > alns/GM24385_all_5x.bam
samtools view -bS -s 0.213 alns/GM24385_all.bam > alns/GM24385_all_10x.bam
samtools view -bS -s 0.426 alns/GM24385_all.bam > alns/GM24385_all_20x.bam

Repeat steps to run pbsv, sniffles, svim and cuteSV.

For sniffles, svim and cuteSV, I respectively uesd support-read 2, 3, 4 and 10 for 5x, 10x, 20x and 47x.

tjiangHIT/sv-benchmark