KOIOS is an open source tool developed and supported by the OHDSI Oncology WG that allows users to combine their variant data with the OMOP Genomic Vocabulary in order to generate a set of genomic standard concept IDs from raw patient-level genomic data.
KOIOS can presently be installed directly from GitHub:
# install.packages("devtools")
devtools::install_github("odyOSG/KOIOS")
The file userScript.R may be loaded as a default workflow wherein only the initial reference genome and VCF file or VCF files directory need be specified.
Users must provide at least one valid VCF file in either .vcf or .vcf.gz format. This may be in the form of a single file, or a directory containing a set of .vcf or .vcf.gz files.
Users may simply run KOIOS according to the following simple pipeline:
library(KOIOS)
#Load the OMOP Genomic Vocabulary into R
concepts <- loadConcepts()
#Specify input file or directort
vcf <- loadVCF(userVCF = "Input.vcf")
#Specify and load human reference genome, if known
ref <- "hg19"
ref.df <- loadReference(ref)
#Process VCF and generate all relevant HGVSG identifiers for input records
vcf.df <- processVCF(vcf)
vcf.df <- generateHGVSG(vcf = vcf.df, ref = ref.df)
vcf.df <- processClinGen(vcf.df, ref = ref, progressBar = F)
#Combine this output data with the OMOP Genomic vocab to produce a DF containing a list of concept codes
vcf.df <- addConcepts(vcf.df, concepts, returnAll = T)
If the user is unaware of the reference genome used to generate a given VCF file they may run the following command, which checks their VCF variants against known ClinGen variants.
vcf <- loadVCF(userVCF = "Input VCF")
ref <- "auto"
ref <- findReference(vcf)
ref.df <- loadReference(ref)
Multiple VCF files within a single directory may be submitted simultaneously within a single command:
#Load the VCF directory
vcf <- loadVCF(userVCF = "SomeDirectory/")
#Set ref to hg19
ref <- "hg19"
concepts.df <- multiVCFPipeline(vcf, ref, generateTranscripts, concepts)
While it is possible to use the automatic reference finder for multiple files, it is not recommended due to the long runtime.
It is also possible to run KOIOS on VCF-like data formats, with examples detailed below. An appropriate reference is required, as with VCF data.
mutations <- read.csv("data_mutations.txt", sep = "\t")
#reference information is likely stored in mutations$NCBI_Build
mut_vcf <- processcBioPortal(mutations)
mut_vcf <- processClinGen(mut_vcf, ref = ref, progressBar = F)
mut_vcf <- addConcepts(mut_vcf,concepts)
HGVSg data can be directly read into KOIOS and submitted via the processClinGen function. A minimal HGVSg dataframe input requires a column named “hgvsg”.
hgvsg <- read.csv("hgvsg.csv", sep = "\t")
hgvsg <- processClingen(hgvsg,ref=ref)
Data already formatted into transcript (HGVSc) or protein (HGVSp) formats, such as with cBioPortal input data (As below), may also be submitted to KOIOS.
These data are simply matched directly with the extended concepts object, derived from the OMOP Genomic vocabulary.
transcript_data <- read.csv("data_transcripts.txt", sep = "\t")
transcript_merge <- merge(mut_transcripts,concepts_ext,by.x="hgvsc",by.y="concept_synonym_name)
#The following is an optional step to remove version information from input transcript HGVSc.
#This allows for a wide range of older data to be submitted to the vocabulary, but has a small chance of generating false positive matches.
#transcript_data$match_hgvs <- gsub(".[0-9]*:",":",mut_transcripts$HGVSc)
#concepts_ext$match_hgvs <- gsub(".[0-9]*:",":",concepts_ext$concept_synonym_name)
#transcript_merge <- merge(mut_transcripts,concepts_ext,by="match_hgvs")
KOIOS may also be used to match gene fusion data with the relevant concept_ids, such as with cBioPortal gene fusion data (As below).
concepts_fusion <- loadConcepts_fusions()
fusions_data <- read.csv("data_sv.txt", sep = "\t")
fusions_data <- generateFusions_cBioPortal(fusions_data,concepts_fusion)
If you encounter a clear bug, please file an issue with a minimal reproducible example at the GitHub issues page.