This project introduces vcfpp (vcf plus plus), a single C++ file as interface to the basic htslib
, which can be easily included in a C++ program
for scripting high-performance genomic analyses.
Features:
- single file to be easily included and compiled
- easy and safe API to use.
- objects are RAII. no worry about allocate and free memory.
- has the full functionalities of the
htslib
, eg. supports of compressed VCF/BCF and URL link as filename. - compatible with C++11 and later
- install htslib on your system
- download the released vcfpp.h
- put
vcfpp.h
in the same folder as your cpp source file or a folder say/my/writable/path/
or the system path
The documentation of API is here.
In this example, we count the number of heterozygous genotypes for each
sample in all records. You can paste the example code into a
example.cpp
file and compile it by g++ example.cpp -std=c++11 -O3 -Wall -I. -lhts
.
You can replace -I.
with -I/my/writable/path/
if you put vcfpp.h
there.
#include "vcfpp.h"
using namespace std;
using namespace vcfpp;
int main(int argc, char* argv[])
{
// read data from 1000 genomes server
BcfReader vcf("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");
BcfRecord var(vcf.header); // construct a variant record
vector<char> gt; // genotype can be bool, char or int type
vector<int> hetsum(vcf.nsamples, 0);
while (vcf.getNextVariant(var)) {
var.getGenotypes(gt);
if (!var.isSNP() || !var.isNoneMissing()) continue;
assert(var.ploidy()==2); // make sure it is diploidy
for(int i = 0; i < gt.size() / 2; i++)
hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]);
}
for (auto i : hetsum) { cout << i << endl; }
return 0;
}
There are 3 types used for genotypes, ie. vector<bool>
, vector<char>
and vector<int>
. One can use vector<bool>
and vector<char>
for
memory-effcient goal. The downside is that it only stores 0 and 1. And
vector<int>
can store missing values and multialleles.
If you use vector<bool>
and vector<char>
to store the genotypes, then
there is no way to represent missing values. Hence the returned
genotypes always have 0s and 1s. And genotypes with missing allele
(eg. 0/.
, ./0
, 1/.
, ./1
, ./.
) are codes as 1/0
. It’s recommended to
use var.isNoneMissing()
to check if there is missing value.
If this default behavior for vector<bool>
and vector<char>
is not what
you want, you should use vector<int>
to store the genotypes, then any
missing allele will be coded as -9
. Note you should take the missing
value -9
into account for downstream analysis.
There are many ways in vcfpp for writing the VCF/BCF file.
Here we construct an initial BCF with header using VCF4.3 specification. Next we add meta data in the header and write out variant record given a string.
BcfWriter bw("out.bcf.gz", "VCF4.3");
bw.header.addFORMAT("GT", "1", "String", "Genotype");
bw.header.addINFO("AF", "A", "Float", "Estimated allele frequency in the range (0,1)");
bw.header.addContig("chr20"); // add chromosome
for (auto& s : {"id01", "id02", "id03"}) bw.header.addSample(s); // add 3 samples
bw.writeLine("chr20\t2006060\trs146931526\tG\tC\t100\tPASS\tAF=0.000998403\tGT\t1|0\t1|1\t0|0");
In this example, we first read VCF file
test/test-vcf-read.vcf.gz
. Secondly, we construct an empty variant record
and update the record with the input VCF. Thirdly, we construct a
BcfWriter object using the meta data in the header of the input VCF,
writing out the header and the modified variant record.
BcfReader br("test/test-vcf-read.vcf.gz");
BcfRecord var(br.header);
br.getNextVariant(var);
BcfWriter bw("out.vcf.gz", br.header);
bw.writeHeader();
var.setPOS(100001); // update the POS of the variant
bw.writeRecord(var);
All variants related API can be found BcfRecord. The commonly used are listed below.
BcfReader vcf("bcf.gz"); // construct a vcf reader
BcfRecord var(vcf.header); // construct an empty variant record associated with vcf header
vcf.getNextVariant(var) // get next variant
vector<char> gt; // genotype can be bool, char or int type
var.getGenotypes(gt), var.setGenotypes(gt); // get or set genotypes for current variant
var.isNoneMissing(); // check if there is missing value after getting genotypes
vector<int> gq; // genotype quality usually is of int type
var.getFORMAT("GQ",gq), var.setFORMAT("GQ",gq); // get or set a vector of genotypes quality
vector<int> pl; // Phred-scaled genotype likelihoods usually is of int type
var.getFORMAT("PL",pl); // get a vector of Phred-scaled genotype likelihoods
float af;
var.getINFO("AF", af), var.setINFO("AF", af); // get or set AF (allele frequency) value in INFO
int mq;
var.getINFO("MQ",mq) // get MQ (Average mapping quality) value from INFO
vector<int> dp4; // Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases
var.getINFO("DP4", dp4), var.setINFO("DP4", dp4); // get or set a vector of dp4 value from INFO
var.isSNP(); // check if variant is SNP
var.isSV(); // check if variant is SV
var.isIndel(); // check if variant is indel
var.isMultiAllelic(); // check if variant is MultiAllelic
var.POS(), var.setPOS(); // get POS or modify POS
All variants related API can be found in BcfHeader.
Examples of vcfpp working with R are in folder Rcpp and https://github.com/Zilong-Li/vcfppR.
Examples of vcfpp working with Python are in folder Pybind11.
Find more useful command line tools in folder tools.