OntologyEval is an approach that can be used to assess how similar samples are to each other. This can be used for example to compare batch effect correction of normalisation methods on heterogeneous datasets. It uses a cell type ontology to evaluate whether datasets derived from different cell types are as similar to one another as one would expect from the ontology.
This repository includes:
- R-Code that can be used to compute an ontology based score for a user defined input.
- Similiarity measures based on Cosine Similarity and Jaccard Coefficient inferred from the Cell Ontology [1]
- Example data to run OntologyEval.
- Rscripts used to generate the Figures used in the manuscript for OntologyEval.
Please note that, due to size constraints, we can not provide the original gene expression data used in the manuscript. Please contact us via e-mail if you are interessted in that. The reprocessed IHEC data sets are included in the DEEP-Blue webserver.
In the OntologyEval/Data folder, we provide the following files related to the Cell Ontology:
- cosinesim.tsv
- jaccardsim.tsv
- idx2termid.tsv
The files cosinesim.tsv and jaccardsim.tsv contain similiarity scores derived from the Cell Ontology computed using Cosine similarity and Jaccard index, respectively. As shown by the head command,
head cosinesim.tsv
0 0 0.9999999999999998
0 1 0.7844645405527361
0 2 0.8944271909999159
0 3 0.5897678246195885
0 4 0.5252257314388902
these are tab seperated files containing three columns, without a header. The first two columns are an index of tissues, the third column is the similiarity score. The files contains all tissues available in the Cell Ontology.
The file idx2termid contains a mapping from the indices of tissues used in the similiarty files to Cell Ontology IDs:
head idx2termid.tsv
0 CL:0000000
1 CL:0000001
2 CL:0000003
3 CL:0000005
4 CL:0000006
OntologyEval requires the user to provide:
-
an ontology file, e.g. OntologyEval/Data/cosinesim.tsv. The file OntologyEval/Data/cosinesim.tsv is used by default.
-
a mapping from indices to ontology terms, e.g. OntologyEval/Data/idx2termid.tsv. The provided file is loaded by default.
-
a mapping from sample IDs to ontology terms. An example is shown in OntologyEval/Data/Example_Terms:
head Example_Terms SRR659649_Liver CL:0000182 SRR807971_Liver CL:0000182 SRR807995_Liver CL:0000182 SRR815140_Liver CL:0000182
This files needs to be customly generated by the user.
-
a matrix with observed/measured values, e.g. quantified gene expression data. The file holds sample IDs in the columns, gene IDs in the rows. An example is included in: OntologyEval/Data/ExampleData.rds. Note that in addition to rds files, also txt files can be used.
The tool generates :
- a txt file holding the ontology scores for each sample and the respective Cell Ontology term
- optionally, a boxplot depicting the ontology score for each Cell Ontology term.
The following arguments can be passed to OntologyEval:
- --Ontology: The ontology file to be used. Default is Data/cosinesim.tsv.
- --Idx2Term: The index to ontology term mapping file. Default is Data/idx2termid.tsv,
- --Sample2Term: The sampleID to ontology term mapping file. Default is the example file Data/Example_Terms.txt,
- --ObservedScoreMatrx: The matrix holding observed data, either in rds or txt format. Default is the example file Data/ExampleData.rds.
- --ObservedSimMethod: The method to assess similarity across the PCs on the observed data. Can be any of pearson, spearman, kendall. Default is spearman.
- --OntologySimMethod: The method to assess similarity across the distance vectors. Can be any of pearson, spearman, kendall. Default is spearman.
- --Output: Name of the output file holding the ontology scores. Default is Ontology_Score_Output.txt,
- --fontsize: fontsize to be used in a boxplot if generated. Default is 20.
- --Log2: TRUE (default) if observed data should be logarithmized, FALSE otherwise.
- --Center: TRUE (default) if observed data should be centered at 0, FALSE otherwise.
- --Scale: TRUE (default) if observed data should be scaled between 0 and 1, FALSE otherwise.
- --nPCA: Number of PC components to be considered to compute the distance on observed data. Default is 4.
To run the example included in the repository, make sure you are in the main repository folder:
cd OntologyEval
There, you find a Rscript
computeOntologyScore.R
Without providing any additional arguments the example can be executed via the command
Rscript computeOntologyScore.R
This will generate the output file Ontology_Score_Output.txt, which contains per sample the ontology score as well as the CL Term ID:
head Ontology_Score_Output.txt
Sample Score TermID
SRR659649_Liver 0.788230948858356 CL:0000182
SRR807971_Liver 0.788230948858356 CL:0000182
SRR807995_Liver 0.788230948858356 CL:0000182
SRR815140_Liver 0.788230948858356 CL:0000182
SRR815711_Liver 0.773573493946729 CL:0000182
By providing the pngFile parameter a boxplot will be generated visualzing the scores across the CL terms:
Rscript computeOntologyScore.R --pngFile="Example.png"
We provide all generated result files and R-Code to recreate the main and supplementary Figures included in the manuscript.
To generate the Figures, the R-packages ggplot2, ggpubr, and gridExtra need to be installed.
Enter the folder Figures:
cd Figures
Here, we provide Rscripts to generate each main and corresponding supplementary Figure(s):
- generateFigure2.R
- generateFigure3.R
- generateFigure4.R
- generateFigure5.R
The necessary data is loaded automatically in the Rscripts. They need to be started from this directory by typing:
Rscript generateFigure2.R
Figures are stored in both svg and pdf format.