knausb/vcfR

Adding INFO tags on the basis information in the INFO tags

Closed this issue · 4 comments

Hello,

I was wondering if we can add INFO tags to an object on the basis of an already existing INFO tag. For example, a vcf contains an INFO tag such as 'GENEINFO' on the basis what 'GENEINFO' equals I want to add more INFO tags such as 'Gene_aliases' or 'ExAC_pLI' mostly the information present in dbNSFP${version}_gene.gz

In your examples, I can filter a file but adding something I could not see. Do you think vcfR can perform such a thing?

Thank you.

Hello,

I'm not familiar with this sort of data. So perhaps I do not understand. But I think this can be done with paste().

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data("vcfR_test")
vcfR_test@fix[, "INFO"]
#> [1] "NS=3;DP=14;AF=0.5;DB;H2"           "NS=3;DP=11;AF=0.017"              
#> [3] "NS=2;DP=10;AF=0.333,0.667;AA=T;DB" "NS=3;DP=13;AA=T"                  
#> [5] "NS=3;DP=9;AA=G"
vcfR_test@fix[, "INFO"] <- paste(vcfR_test@fix[, "INFO"], ";Gene_aliases=", 1:5, sep = "")
vcfR_test@fix[, "INFO"]
#> [1] "NS=3;DP=14;AF=0.5;DB;H2;Gene_aliases=1"          
#> [2] "NS=3;DP=11;AF=0.017;Gene_aliases=2"              
#> [3] "NS=2;DP=10;AF=0.333,0.667;AA=T;DB;Gene_aliases=3"
#> [4] "NS=3;DP=13;AA=T;Gene_aliases=4"                  
#> [5] "NS=3;DP=9;AA=G;Gene_aliases=5"

Created on 2021-03-23 by the reprex package (v1.0.0)

Does that help?

It does. As we adding the information to the INFO column we have to update the header also. Is it done the same by using paste?

In VCF data there is a "meta" section where you can document your flags. In a vcfR object this region is a character vector. So you should add a new element for each flag.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data("vcfR_test")
class(vcfR_test@meta)
#> [1] "character"
vcfR_test@meta
#>  [1] "##fileformat=VCFv4.3"                                                                                                  
#>  [2] "##fileDate=20090805"                                                                                                   
#>  [3] "##source=myImputationProgramV3.1"                                                                                      
#>  [4] "##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta"                                                      
#>  [5] "##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species=\"Homo sapiens\",taxonomy=x>"
#>  [6] "##phasing=partial"                                                                                                     
#>  [7] "##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">"                                      
#>  [8] "##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Total Depth\">"                                                      
#>  [9] "##INFO=<ID=AF,Number=A,Type=Float,Description=\"Allele Frequency\">"                                                   
#> [10] "##INFO=<ID=AA,Number=1,Type=String,Description=\"Ancestral Allele\">"                                                  
#> [11] "##INFO=<ID=DB,Number=0,Type=Flag,Description=\"dbSNP membership, build 129\">"                                         
#> [12] "##INFO=<ID=H2,Number=0,Type=Flag,Description=\"HapMap2 membership\">"                                                  
#> [13] "##FILTER=<ID=q10,Description=\"Quality below 10\">"                                                                    
#> [14] "##FILTER=<ID=s50,Description=\"Less than 50% of samples have data\">"                                                  
#> [15] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"                                                        
#> [16] "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">"                                               
#> [17] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Read Depth\">"                                                     
#> [18] "##FORMAT=<ID=HQ,Number=2,Type=Integer,Description=\"Haplotype Quality\">"

Created on 2021-03-26 by the reprex package (v1.0.0)

Okay thank you!