A python parser to simplify the vcf file into a (pseudo-)tabular format. There are several tools available to mainpulate and alter VCF files. But, a simple but expansive tool that can produce most simple output is required by empirical biologist is still amiss.
-
Convert VCF to TABLE
This tool takes in a sorted vcf file and reports a simplified tabular output forINFO
andFORMAT
field for eachSAMPLE
of interest. With default state (minimal code) all theINFO
,FORMAT
for all theSAMPLE
are simplified. Fields can be further narrowed down using very convenient and comprehensive scripts. -
Convert TABLE to VCF
It is also possible to convert the TABLE file into VCF. Controlled, workflows are included.
Exclusively for phase-Stitcher and phase-Extender. Controlled workflows are included.
- Convert VCF to Haplotype\
- Convert Haplotype to VCF\
Required Python packages and modules
$ python3 VCF-Simplify.py -h Checking required modules VCF-Simplify: A tool to convert VCF to TABLE and/or HAPLOTYPE file and vice-versa. usage: VCF-Simplify [-h] {SimplifyVCF,BuildVCF} ... positional arguments: {SimplifyVCF,BuildVCF} Choose one of the following method. SimplifyVCF Simplify VCF : to a haplotype or a table file. BuildVCF Create VCF : from a haplotype or a table file. optional arguments: -h, --help show this help message and exit
You can see that there are two options available for conversion:
- SimplifyVCF : to convert from VCF ----> to TABLE and/or Haplotype file
- BuildVCF : to convert from TABLE and/or Haplotype ---> to VCF
$ python3 VCF-Simplify.py SimplifyVCF -h Checking required modules VCF-Simplify: A tool to convert VCF to TABLE and/or HAPLOTYPE file and vice-versa. usage: VCF-Simplify SimplifyVCF [-h] -toType TOTYPE -inVCF INVCF -out OUT [-keepHeader KEEPHEADER] [-PG PG] [-PI PI] [-unphased UNPHASED] [-samples SAMPLES] [-preHeader PREHEADER] [-infos INFOS] [-formats FORMATS] [-mode MODE] [-gtbase GTBASE] optional arguments: -h, --help show this help message and exit -toType TOTYPE Type of the output file. Option: haplotype, table -inVCF INVCF sorted vcf file -out OUT name of the output file -keepHeader KEEPHEADER Write the HEADER data to a separate output file.Options: 'yes' or 'no' Additional flags for "VCF To Haplotype": -PG PG FORMAT tag containing the phased genotype of the SAMPLE. Only applicable for 'haplotype file output'. -PI PI FORMAT tag representing the unique index of RBphased haplotype block in the SAMPLE. Only applicable for 'haplotype file output'. Note: 'CHROM' can also be used as PI if VCF is phased chromosome-wide. -unphased UNPHASED include unphased variants in the output. Available options: yes, no Additional flags for "VCF To Table": -samples SAMPLES SAMPLE of interest; write as comma separated names, for e.g: 'sampleA,sampleB' or 'all'. -preHeader PREHEADER Comma separated pre-header fields before the 'INFO' field in the input VCF file. Write as comma separated fields, for e.g: 'CHR,POS,ID' or 'all'. Default: 'all'. -infos INFOS INFO tags that are of interest; write as comma separated tags; for e.g: 'AC,AF,AN' or 'all'. -formats FORMATS FORMAT tags that are of interest; for e.g: 'GT,PG,PI' or 'all'. -mode MODE Structure of the output table.Options: wide(0), long(1). Default: 0 . -gtbase GTBASE write the GT field as IUPAC base code.Options: no(0), yes(1). Default: 0 .
python3 VCF-Simplify.py SimplifyVCF -toType table -inVCF input_test.vcf -out simple_table.txt -infos AF,AN,BaseQRankSum,ClippingRankSum -formats PI,GT,PG -preHeader CHROM,POS,REF,ALT,FILTER -mode wide -samples MA605,ms01e -gtbase yes
- Converts the "input_test.vcf" to a TABLE file.
- Uses samples MA605 and ms01e
- Uses FORMAT tags: PI,GT and PG
- Uses INFO tags: AF,AN,BaseQRankSum,ClippingRankSum
- Outputs TABLE in "wide" layout
- Outpts GT field as IUPAC base
Output in Wide format
CHROM POS REF ALT FILTER AF AN BaseQRankSum ClippingRankSum MA605:PI MA605:GT MA605:PG ms01e:PI ms01e:GT ms01e:PG CHROM POS REF ALT FILTER AF AN BaseQRankSum ClippingRankSum MA605:PI MA605:GT MA605:PG MA611:PI MA611:GT MA611:PG MA622:PI MA622:GT MA622:PG 2 15881018 G A,C PASS 1.0 8 -0.771 0.0 . 0/0 0/0 . 0/0 0/0 . 0/0 0/0 2 15881080 A G PASS 0.458 6 -0.732 0.0 . 0/0 0/0 . 0/0 0/0 . 0/0 0/0 2 15881106 C CA PASS 0.042 6 0.253 0.0 . 0/0 0/0 . 0/0 0/0 . 0/0 0/0 2 15881156 A G PASS 0.5 6 None None . 0/0 0/0 . 0/0 0/0 . 0/0 0/0 2 15881224 T G PASS 0.036 12 1.75 0.0 . 0/0 0/0 . 0/0 0/0 . 0/0 0/0 2 15881229 C G PASS 0.308 10 None None . 0/0 0/0 . 0/0 0/0 . 0/0 0/0 2 15881230 GTT G,GTTT PASS 0.346,0.038 10 0.0 0.0 . 0/0 0/0 . 0/0 0/0 . 0/0 0/0 2 15881246 A AT PASS 0.333 8 None None . 0/0 0/0 . 0/0 0/0 . 0/0 0/0 2 15881256 A T PASS 0.333 8 None None . 0/0 0/0 . 0/0 0/0 . 0/0 0/0 2 15881266 T G PASS 0.286 12 None None . 0/0 0/0 . 0/0 0/0 . 0/0 0/0
Output in long format:
python3 VCF-Simplify.py SimplifyVCF -toType table -inVCF input_test.vcf -out simple_table.txt -infos AF,AN,BaseQRankSum,ClippingRankSum -formats PI,GT,PG -mode long -samples MA605,MA611,MA622
CHROM POS ID REF ALT QUAL FILTER AF AN BaseQRankSum ClippingRankSum SAMPLE PI GT PG 2 15881018 . G A,C 5082.45 PASS 1.0 8 -0.771 0.0 MA605 . 0/0 0/0 2 15881018 . G A,C 5082.45 PASS 1.0 8 -0.771 0.0 MA611 . 0/0 0/0 2 15881018 . G A,C 5082.45 PASS 1.0 8 -0.771 0.0 MA622 . 0/0 0/0 2 15881080 . A G 4336.44 PASS 0.458 6 -0.732 0.0 MA605 . 0/0 0/0 2 15881080 . A G 4336.44 PASS 0.458 6 -0.732 0.0 MA611 . 0/0 0/0 2 15881080 . A G 4336.44 PASS 0.458 6 -0.732 0.0 MA622 . 0/0 0/0 2 15881106 . C CA 33.32 PASS 0.042 6 0.253 0.0 MA605 . 0/0 0/0 2 15881106 . C CA 33.32 PASS 0.042 6 0.253 0.0 MA611 . 0/0 0/0 2 15881106 . C CA 33.32 PASS 0.042 6 0.253 0.0 MA622 . 0/0 0/0 2 15881156 . A G 3595.22 PASS 0.5 6 None None MA605 . 0/0 0/0 2 15881156 . A G 3595.22 PASS 0.5 6 None None MA611 . 0/0 0/0 2 15881156 . A G 3595.22 PASS 0.5 6 None None MA622 . 0/0 0/0
Use minimal script for full term simplification:
*Output not shown
python3 VCF-Simplify.py SimplifyVCF -to table -inVCF input_test.vcf -out simple_table.txt -keepHeader yes
- Simplified data for all the infos, formats for all the sample.
- Will output in wide format by default.
Include "-keepHeader " to store meta header of the VCF as separate file.
*Output not shown
python3 VCF-Simplify.py SimplifyVCF -to table -inVCF input_test.vcf -out simple_table.txt -keepHeader yes
- Converts a VCF to a HAPLOTYPE file.
- The HAPLOTYPE file can be used downstream with tools:
- phase-Extender
- phase-Stitcher
- "PG" flag is used to indicate "phased-genotype" field in SAMPLE
- "PI" flag is used to indicate "haplotype-block" index
- Other FORMAT fields can be used with PG and PI
- "CHROM" field can be included with PI flag, assuming the VCF is phased chromosome wide.
$ python3 VCF-Simplify.py SimplifyVCF -toType haplotype -inVCF input_test.vcf -out simple_haplotype.txt
CHROM POS all-alleles ms01e:PI ms01e:PG_al ms02g:PI ms02g:PG_al ms03g:PI ms03g:PG_al ms04h:PI ms04h:PG_al MA611:PI MA611:PG_al MA605:PI MA605:PG_al MA622:PI MA622:PG_al 2 15881551 A,T . . . . . . 9 T|A . . . . . . 2 15881553 C,A 4 C|A . . . . 9 C|C . . . . . . 2 15881764 T,C 4 C|T 6 C|T . . 9 T|T . . . . . . 2 15881767 C,T 4 C|C 6 T|C . . 9 C|C . . . . . . 2 15881810 A,C 4 C|C 6 C|C . . 9 C|C . . . . . . 2 15881944 C,T 4 T|C 6 C|C 7 C|T 7 C|T . . . . . . 2 15881974 C,A 4 A|C 6 C|C 7 C|A 7 C|A . . . . . . 2 15881989 C,A 4 C|C 6 A|C 7 C|C 7 C|C . . . . . . 2 15882091 A,T 4 A|T 6 A|T 7 T|A 7 A|A . . . . . . 2 15882148 T,G 4 T|T 6 T|T 7 T|T 7 T|G . . . . . . 2 15882328 T,A 4 A|T 6 T|T 7 T|T 7 T|T . . . . . . 2 15882364 T,G 4 G|T 6 T|T 7 T|T 7 T|T . . . . . . 2 15882451 T,C 4 C|T 4 C|T 7 T|T 7 T|T . . . . . . 2 15882454 T,C 4 C|T 4 C|T 7 T|T 7 T|T . . . . . . 2 15882493 T,C 4 C|T 4 C|T 7 T|T 7 T|T . . . . . . 2 15882505 T,A 4 A|T 4 A|T 7 T|T 7 T|T . . . . . . 2 15882583 G,T 4 G|G 4 G|G 7 G|G 5 G|T . . . . . . 2 15882592 G,A 4 A|G 4 A|G 6 G|A 5 G|A . . . . . .
If your "GT" tag contains the phased genotype you can
python3 VCF-Simplify.py SimplifyVCF -to haplotype -inVCF input_test.vcf -out simple_table.txt -PG GT -PI PI
If you want "CHROM" field as "-PI" index
python3 VCF-Simplify.py SimplifyVCF -to haplotype -inVCF input_test.vcf -out simple_table.txt -PG GT -PI CHROM
Additionally to output unphased genotypes and store the header of the VCF
python3 VCF-Simplify.py SimplifyVCF -to haplotype -inVCF input_test.vcf -out simple_table.txt -PG GT -PI PI -unphased yes -keepHeader yes
$ python3 VCF-Simplify.py BuildVCF -h Checking required modules VCF-Simplify: A tool to convert VCF to TABLE and/or HAPLOTYPE file and vice-versa. usage: VCF-Simplify BuildVCF [-h] -fromType FROMTYPE -inFile INFILE -outVCF OUTVCF -vcfHeader VCFHEADER [-GTbase GTBASE] [-samples SAMPLES] [-formats FORMATS] [-infos INFOS] optional arguments: -h, --help show this help message and exit -fromType FROMTYPE Type of the input file to prepare the VCF from. Options: haplotype, table -inFile INFILE Sorted table or haplotype file. This haplotype file can be obtained from phase-Stitcher or phase- Extender. The table file should be in the format output by 'VCF-Simplify'; only long format table is supported for now. -outVCF OUTVCF Name of the output VCF file. -vcfHeader VCFHEADER A custom VCF header to add to the VCF file. The VCF header should not contain the line with #CHROM .... Additional flags for "Table To VCF": -GTbase GTBASE Representation of the GT base is : numeric (0), IUPAC (1) -samples SAMPLES Name of the samples -> comma separated name of the samples that needs to be converted to VCF format -formats FORMATS Name of the FORMAT tags to write -> comma separated FORMAT tags name. -infos INFOS Name of the INFO tags to write -> comma separated INFO tags name.
*Note:
- requires "VCF header" from other VCF or custom VCF, with no
#CHROM
line. - the type of the data in "GT" should be indicated.
python3 VCF-Simplify.py BuildVCF -fromType table -inFile simple_table.txt -vcfHeader vcf_header.txt -outVCF table_toVCF.vcf -GTbase numeric
*Note: This run with a minimal script.
python3 VCF-Simplify.py BuildVCF -fromType haplotype -inFile simple_haplotype.txt -vcfHeader vcf_header.txt -outVCF haplotype_toVCF.vcf
- Ability to add
genotype bases
for fields other than "GT" . - Ability to handle symbolic alleles.
- Ability to :
- prepare custom diploid genome.
- prepare custom GTF, GFF files.
- Extract gene sequence using ref genome and VCF files for phylogenetic analyses.
Giri, B.K, (2018). VCF-simplify: Tool to build and simplify VCF (variant call format) files.