/vawk

An awk-like VCF parser

Primary LanguagePythonMIT LicenseMIT

vawk

An awk-like VCF parser

##Table of Contents

  1. Usage
  2. Examples

##Usage

usage: vawk [-h] [-v VAR] [--header] [--debug] cmd [vcf]

positional arguments:
  cmd                vawk command syntax is exactly the same as awk syntax with
                          a few additional features. The INFO field can be split using
                          the I$ prefix and the SAMPLE field can be split using
                          the S$ prefix. For example, I$AF prints the allele frequency of
                          each variant and S$NA12878 prints the entire SAMPLE field for the
                          NA12878 individual for each variant.

                          The SAMPLE field can be further split based on the keys in the
                          FORMAT field of the VCF (column 9). For example, S$NA12877$GT
                          returns the genotype of the NA12878 individual. S$* returns all samples.
                          
                          ex: '{ if (I$AF>0.5) print $1,$2,$3,I$AN,S$NA12878,S$NA12877$GT }'

  vcf                VCF file (default: stdin)

optional arguments:
  -h, --help         show this help message and exit
  -v VAR, --var VAR  declare an external variable (e.g.: SIZE=10000)
  --header           print VCF header
  --debug            debugging level verbosity

##Examples

Print the variant genotypes for sample CHM1.

$ vawk '{ print S$CHM1$GT }' CHM1.lumpy.vcf

If the genotype is 0/1 in CHM1, print a subset of the fields.

$ vawk -v MYGT="0/1" '{ if (S$CHM1$GT==MYGT) print $1,$2,$3,$4,$5,S$CHM1$SUP }' CHM1.lumpy.vcf  | head
# 1    755446	   1  T	 <DEL>		   16
# 1    839915	   2  C	 <DEL>		   10
# 1    900049	   3  C	 <DEL>		   9
# 1    965024	   4  T	 <DEL>		   5
# 1    2038157	   5  A	 <DEL>		   5
# 1    2143123	   6  T	 <DUP>		   9
# 1    2367925	   7  C	 <DEL>		   4
# 1    2765775	   8  G	 <DEL>		   4
# 1    3027000	   9  G	 <DEL>		   4
# 1    3092611	   10 T	 <DUP>		   11

If the genotype of sample NA12878 is 0/1 and QUAL > 6, print the genotype and supporting evidence types for the variant.

$ zcat ceph1463.lumpy.vcf.gz | vawk '{if (S$NA12878$GT=="0/1" && $6>6) print S$NA12891$GT, I$EVTYPE}'  | head
# 0/1   PE_ALONE
# 0/0   PE_ALONE
# 0/1   PE_ALONE
# 0/1   PE_ALONE
# 0/0   PE_ALONE
# 0/1   SR_ALONE
# 0/0   PE_ALONE
# 0/1   PE_SR
# 0/0   PE_ALONE
# 0/1   PE_ALONE

If the genotype of sample NA12878 is 0/1, print the entire line. Also, include the VCF header in the output.

$ zcat ceph1463.lumpy.vcf.gz | vawk --header 'S$NA12878$GT=="0/1"'

If the variant has split-read support greater than 3, print a subset of the info fields.

$ zcat ceph1463.lumpy.vcf.gz | vawk '{ if (I$SRSUP>3) print $1,$4,$5,$2,I$END, I$END-$2, I$SVLEN, I$EVTYPE, I$PESUP, I$SRSUP} ' | head
# 1 A   <DEL>   988595  988622  27  -27 PE_SR   1   9
# 1 C   <DEL>   991860  991981  121 -121    SR_ALONE    0   21
# 1 G   <DEL>   1142719 1143139 420 -420    PE_SR   1   12
# 1 C   <DEL>   1362765 1362821 56  -56 SR_ALONE    0   6
# 1 A   <DEL>   1530383 1530685 302 -302    PE_SR   2   86
# 1 T   <DUP>   1880413 1880554 141 141 SR_ALONE    0   8
# 1 G   <DUP>   1912966 1913504 538 538 SR_ALONE    0   4
# 1 T   <DEL>   2054429 2055813 1384    -1384   SR_ALONE    0   40
# 1 C   <DUP>   2212237 2213976 1739    1739    SR_ALONE    0   9
# 1 G   <DEL>   2837835 2837901 66  -66 SR_ALONE    0   113

Print a genotype field for all samples with '*'

$ zcat mult_gt_rd.sv.vcf.gz | vawk '{ print S$*$GT }' | head
# 0/0  0/0
# 0/1  1/1
# 0/0  0/0
# 0/0  0/1
# 0/0  0/1
# 0/1  0/0
# 0/0  0/0
# 0/0  0/0
# 0/0  0/0
# 0/0  0/1