Javier F. Tabima February 12, 2020
- Objectives
- Data sets
- Procedure
- Downloading Orthofinder
- Running OrthoFinder
- OrthoFinder results
Assessments of homology using genomic data allow for the identification of genes under selection, expansion or reduction of genic families, analysis of presence or absence of important genes, and identification of core genes across different taxonomic levels. These assays based on orthologous gene searches come in different flavors, and many tools have been created to identify homologous genes across different genomes (e.g. OrthoMCL, InParanoid, FastOrtho, etc).
In this tutorial we will focus on OrthoFinder (Emms et al. 2015; https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0721-2), a fast and simple software focused on phylogenetic orthology inference. We will focus on the basics of this program, how to run it in the cluster, and how to interpret results to identify core homologous genes, genic duplication and how OrthoFinder reconstructs species trees based on orthologous gene data.
OrthoFinder uses protein sequences in FASTA format and it performs optimally on protein sequences predicted from whole genome sequences. For this tutorial we want to explore orthology across the Eukaryotic tree of life. We will use six different species across the tree of life:
- Canis lupus (Dog)
- Amanita muscaria (Fly agaric)
- Phytophthora infestans (Late blight of potato and tomato)
- Arabidopsis thaliana (thale cress)
- Dyctiostelium discoideum (Cellular slime mold)
- Paramecium tetraurelia
- Download all of the samples to a folder in the cluster. The folder
can be any folder that you have access to. In my case, I’ll download
everything in my
home/
folder
cd ~
mkdir orthofinder_test
cd orthofinder_test
- To download these files into the structure you can either download them all separately or one by one. Here are the commands I use:
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.3_CanFam3.1/GCF_000002285.3_CanFam3.1_protein.faa.gz > Canis_lupus.fasta.gz' -r curl_1
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/827/485/GCA_000827485.1_Amanita_muscaria_Koide_BX008_v1.0/GCA_000827485.1_Amanita_muscaria_Koide_BX008_v1.0_protein.faa.gz > Amanita_muscaria.fasta.gz' -r curl_2
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/142/945/GCF_000142945.1_ASM14294v1/GCF_000142945.1_ASM14294v1_protein.faa.gz > Phytophthora_infestans.fasta.gz' -r curl_3
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_protein.faa.gz > Arabidopsis_thaliana.fasta.gz' -r curl_4
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/695/GCF_000004695.1_dicty_2.7/GCF_000004695.1_dicty_2.7_protein.faa.gz > Dyctiostelium_discoideum.fasta.gz' -r curl_5
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/165/425/GCF_000165425.1_ASM16542v1/GCF_000165425.1_ASM16542v1_protein.faa.gz > Paramecium_tetraurelia.fasta.gz' -r curl_6
- Decompress the fasta files
SGE_Batch -c 'bash; for i in *.gz; do gunzip $i; done' -r gunzip
- Remove all the unnecessary files
rm -rf curl_*/
rm -rf gunzip/
To estimate the number of proteins per downloaded genome we can count
the number of sequences in each of the FASTA files we downloaded. To do
so you can count the number of >
characters per file. This is how I
would do it:
for i in *.fasta; do a=$(grep -c ">" $i); printf $i"\t"$a"\n"; done
The results should look like this:
FASTA file | Number of proteins |
---|---|
Amanita_muscaria.fasta | 18,093 |
Arabidopsis_thaliana.fasta | 48,265 |
Canis_lupus.fasta | 58,774 |
Dyctiostelium_discoideum.fasta | 13,315 |
Paramecium_tetraurelia.fasta | 39,580 |
Phytophthora_infestans.fasta | 17,797 |
- Download the latest OrthoFinder.tar.gz release from github:
cd ~
wget https://github.com/davidemms/OrthoFinder/releases/latest/download/OrthoFinder.tar.gz
- Extract the files:
tar xzf OrthoFinder.tar.gz
- Test OrthoFinder:
cd OrthoFinder/
./orthofinder -h
If OrthoFinder works correctly, OrthoFinder should print its ‘help’ text.
For this example, we will run OrthoFinder using the default settings.
One of the greatest advantages of OrthoFinder is that the only input it
requires is the folder where the FASTA sequences are. In our case, the
FASTA aminoacid files are contained in the ~/orthofinder_test
folder.
Make sure you have the absolute path of your folder. In my example, the
path is /raid1/home/bpp/tabimaj/orthofinder_test
So, to run OrthoFinder while in the ~/OrthoFinder/
folder, you can run
the analysis using the following command by requesting 6
processors/slots from the CGRB cluster:
SGE_Batch -c './orthofinder -f /raid1/home/bpp/tabimaj/orthofinder_test -t 8' -P 8 -r BUG_orthofinder
The run of OrthoFinder took 2h 20mins using eight threads at the CGRB
infrastructure. The host which was OrthoFinder ran at was symbiosis
.
The results of your personal OrthoFinder run should be found in the
~/orthofinder_test/Results_Feb12/
folder. For an example run, I have a
folder that will CGRB users read the files from an identical OrthoFinder
folder at
/raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1
While I recommend you to go to the OrthoFinder webpage to understand what files and results come as output from the program, we will focus in four main results today:
- The Statistics Overall file
- The Statistics Per Species file
- The count of genes per Orthogroup predicted by OrthoFinder
- The Single copy orthologs and the Species Tree
Location:
/raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1/Comparative_Genomics_Statistics/Statistics_Overall.tsv
The Statistics Overall file is a tab-separated text file that contains general statistics about orthogroup sizes and proportion of genes assigned to orthogroups.
As mentioned by the OrthoFinder webpage:
Most of the terms in the files ‘Statistics_Overall.csv’ and ‘Statistics_PerSpecies.csv’ are self-explanatory, the remainder are defined below.
Species-specific orthogroup: An orthogroups that consist entirely of genes from one species.
G50: The number of genes in the orthogroup such that 50% of genes are in orthogroups of that size or larger.
O50: The smallest number of orthogroups such that 50% of genes are in orthogroups of that size or larger.
Single-copy orthogroup: An orthogroup with exactly one gene (and no more) from each species. These orthogroups are ideal for inferring a species tree and many other analyses.
Unassigned gene: A gene that has not been put into an orthogroup with any other genes.
Lets take a look at the file. Even if this is a “tab-separated” file, we need to split it on sections:
library(magrittr)
library(knitr)
library(kableExtra)
knitr::kable(read.table("Results/Statistics_Overall.tsv",fill = T, nrows = 24, sep = "\t")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
V1 |
V2 |
---|---|
Number of species |
6 |
Number of genes |
195824 |
Number of genes in orthogroups |
172946 |
Number of unassigned genes |
22878 |
Percentage of genes in orthogroups |
88.3 |
Percentage of unassigned genes |
11.7 |
Number of orthogroups |
25731 |
Number of species-specific orthogroups |
19632 |
Number of genes in species-specific orthogroups |
111223 |
Percentage of genes in species-specific orthogroups |
56.8 |
Mean orthogroup size |
6.7 |
Median orthogroup size |
4.0 |
G50 (assigned genes) |
10 |
G50 (all genes) |
9 |
O50 (assigned genes) |
4442 |
O50 (all genes) |
5664 |
Number of orthogroups with all species present |
1319 |
Number of single-copy orthogroups |
32 |
Date |
2020-02-06 |
Orthogroups file |
Orthogroups.tsv |
Unassigned genes file |
Orthogroups_UnassignedGenes.tsv |
Per-species statistics |
Statistics_PerSpecies.tsv |
Overall statistics |
Statistics_Overall.tsv |
Orthogroups shared between species |
Orthogroups_SpeciesOverlaps.tsv |
-
What is the number of orthogroups that consist entirely of genes from one species? Is it the majority of data? What percentage of the orthogroups is is?
-
What is the number of single-copy orthogroups found in the analysis? What does this mean biologically?
genes_per_OG <- read.delim("Results/Statistics_Overall.tsv",fill = T, skip = 25, nrows = 20, sep = "\t")
kable(genes_per_OG) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Average.number.of.genes.per.species.in.orthogroup |
Number.of.orthogroups |
Percentage.of.orthogroups |
Number.of.genes |
Percentage.of.genes |
---|---|---|---|---|
<1 |
16108 |
62.6 |
47651 |
27.6 |
’1 |
6015 |
23.4 |
47718 |
27.6 |
’2 |
1887 |
7.3 |
26446 |
15.3 |
’3 |
737 |
2.9 |
14887 |
8.6 |
’4 |
399 |
1.6 |
10460 |
6.0 |
’5 |
236 |
0.9 |
7651 |
4.4 |
’6 |
130 |
0.5 |
4987 |
2.9 |
’7 |
70 |
0.3 |
3103 |
1.8 |
’8 |
45 |
0.2 |
2268 |
1.3 |
’9 |
28 |
0.1 |
1569 |
0.9 |
’10 |
21 |
0.1 |
1297 |
0.7 |
11-15 |
44 |
0.2 |
3353 |
1.9 |
16-20 |
8 |
0.0 |
880 |
0.5 |
21-50 |
3 |
0.0 |
676 |
0.4 |
51-100 |
0 |
0.0 |
0 |
0.0 |
101-150 |
0 |
0.0 |
0 |
0.0 |
151-200 |
0 |
0.0 |
0 |
0.0 |
201-500 |
0 |
0.0 |
0 |
0.0 |
501-1000 |
0 |
0.0 |
0 |
0.0 |
’1001+ |
0 |
0.0 |
0 |
0.0 |
I think its easier to see these results on a plot than in the table
library(reshape2)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
genes_per_OG.m <- melt(genes_per_OG, id.vars = "Average.number.of.genes.per.species.in.orthogroup", factorsAsStrings = T)
genes_per_OG.m$Average.number.of.genes.per.species.in.orthogroup <- factor(genes_per_OG.m$Average.number.of.genes.per.species.in.orthogroup,levels = as.character(genes_per_OG$Average.number.of.genes.per.species.in.orthogroup))
ggplot(genes_per_OG.m, aes(x=Average.number.of.genes.per.species.in.orthogroup , y=value, fill=variable)) + geom_bar(stat="identity") + facet_grid(variable~., scales = "free_y") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none")
- What is the overall pattern of the results we find?
species_per_OG <- read.delim("Results/Statistics_Overall.tsv",fill = T, skip = 46, nrows = 7, sep = "\t", header = T)
kable(species_per_OG) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Number.of.species.in.orthogroup |
Number.of.orthogroups |
---|---|
1 |
19632 |
2 |
2027 |
3 |
1128 |
4 |
774 |
5 |
851 |
6 |
1319 |
ggplot(species_per_OG, aes(x=as.factor(Number.of.species.in.orthogroup), y=Number.of.orthogroups, fill=as.factor(Number.of.species.in.orthogroup))) + geom_bar(stat="identity") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none") + xlab("Number of species on orthogroup") + ylab("Number of orthogroups")
-
The highest number of orthogroups contain only one species. What are the biological implications of this result?
-
If OrthoFinder was used in a subset of highly-similar species, would this result be similar or would it change? How so?
Location:
/raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1/Comparative_Genomics_Statistics/Statistics_PerSpecies.tsv
The Statistics per Species file is a tab-separated text file that contains general statistics about the number of genes that are within each category of orthogroups for each species. As with the Statistics Overall table, we have to divide it into different sections:
overall_stats <- read.table("Results/Statistics_PerSpecies.tsv",fill = T, nrows = 10, sep = "\t", header = T)
species_names <- colnames(overall_stats)[-1]
knitr::kable(overall_stats) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
X |
Amanita_muscaria |
Arabidopsis_thaliana |
Canis_lupus |
Dyctiostelium_discoideum |
Paramecium_tetraurelia |
Phytophthora_infestans |
---|---|---|---|---|---|---|
Number of genes |
18093.0 |
48265.0 |
58774.0 |
13315.0 |
39580.0 |
17797.0 |
Number of genes in orthogroups |
12895.0 |
44825.0 |
56239.0 |
9889.0 |
35093.0 |
14005.0 |
Number of unassigned genes |
5198.0 |
3440.0 |
2535.0 |
3426.0 |
4487.0 |
3792.0 |
Percentage of genes in orthogroups |
71.3 |
92.9 |
95.7 |
74.3 |
88.7 |
78.7 |
Percentage of unassigned genes |
28.7 |
7.1 |
4.3 |
25.7 |
11.3 |
21.3 |
Number of orthogroups containing species |
4719.0 |
8553.0 |
9321.0 |
4723.0 |
9480.0 |
5539.0 |
Percentage of orthogroups containing species |
18.3 |
33.2 |
36.2 |
18.4 |
36.8 |
21.5 |
Number of species-specific orthogroups |
1310.0 |
4665.0 |
4930.0 |
769.0 |
6636.0 |
1322.0 |
Number of genes in species-specific orthogroups |
7616.0 |
28905.0 |
36927.0 |
4152.0 |
26548.0 |
7075.0 |
Percentage of genes in species-specific orthogroups |
42.1 |
59.9 |
62.8 |
31.2 |
67.1 |
39.8 |
As these results are very numerous, lets graph them. We will do a subset of the non-percentage data first:
overall_stats.number <- overall_stats[grep(x = overall_stats$X, pattern = "Number"),]
overall_stats.number <- melt(overall_stats.number)
## Using X as id variables
overall_stats.number$X <- gsub(overall_stats.number$X, pattern = "Number of ", replacement = "") %>% tools::toTitleCase()
ggplot(overall_stats.number, aes(x=variable , y=value, fill=X)) + geom_bar(stat="identity") + facet_grid(X~., scales = "free_y") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none") + ylab("Number")
-
What is the species with the most and the least number of genes used in the analysis?
-
What does the Species specific orthologs category mean? Is this representing paralogy?
-
Does this table/plot accurately represent the contribution of each species to the orthology assay?
While the numeric results provide us with an overall assay of the general results of the orthologous assay, the data formatted in percentages may prove a more standardized vision of the results:
overall_stats.percentage <- overall_stats[grep(x = overall_stats$X, pattern = "Number", invert = T),]
overall_stats.percentage <- melt(overall_stats.percentage)
## Using X as id variables
overall_stats.percentage$X <- gsub(overall_stats.percentage$X, pattern = "Percentage of ", replacement = "") %>% tools::toTitleCase()
ggplot(overall_stats.percentage, aes(x=variable , y=value, fill=X)) + geom_bar(stat="identity") + facet_grid(X~., scales = "free_y") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none") + ylab("Percentage (%)")
- What is the species with the most shared genes across the entire data set?
The second section of the Statistics per Species file includes all the statistics of the number and percentage of orthologs that contain data for each species. For today we will focus only on the percentages (Lines 35 - 56):
perecentage_stats <- read.delim("Results/Statistics_PerSpecies.tsv",fill = T, skip = 35, nrows = 20, sep = "\t", header = T)
colnames(perecentage_stats)[-1] <- species_names
knitr::kable(perecentage_stats) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Number.of.genes.per.species.in.orthogroup |
Amanita_muscaria |
Arabidopsis_thaliana |
Canis_lupus |
Dyctiostelium_discoideum |
Paramecium_tetraurelia |
Phytophthora_infestans |
---|---|---|---|---|---|---|
’0 |
81.7 |
66.8 |
63.8 |
81.6 |
63.2 |
78.5 |
’1 |
10.7 |
4.1 |
5.1 |
12.5 |
3.1 |
12.1 |
’2 |
3.6 |
8.9 |
6.8 |
3.1 |
16.0 |
4.7 |
’3 |
1.4 |
5.7 |
5.0 |
1.1 |
6.1 |
1.7 |
’4 |
0.7 |
3.7 |
3.7 |
0.5 |
4.4 |
0.9 |
’5 |
0.4 |
2.4 |
2.9 |
0.3 |
2.0 |
0.5 |
’6 |
0.3 |
1.8 |
2.1 |
0.2 |
1.5 |
0.3 |
’7 |
0.2 |
1.3 |
1.8 |
0.2 |
0.9 |
0.3 |
’8 |
0.1 |
0.9 |
1.4 |
0.1 |
0.6 |
0.2 |
’9 |
0.1 |
0.7 |
1.1 |
0.1 |
0.5 |
0.2 |
’10 |
0.1 |
0.5 |
0.9 |
0.1 |
0.2 |
0.1 |
11-15 |
0.3 |
1.5 |
2.5 |
0.2 |
0.7 |
0.3 |
16-20 |
0.1 |
0.7 |
1.1 |
0.1 |
0.3 |
0.2 |
21-50 |
0.2 |
1.1 |
1.5 |
0.1 |
0.4 |
0.2 |
51-100 |
0.0 |
0.1 |
0.1 |
0.0 |
0.1 |
0.0 |
101-150 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
151-200 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
201-500 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
501-1000 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
’1001+ |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
perecentage_stats$Number.of.genes.per.species.in.orthogroup <- factor(perecentage_stats$Number.of.genes.per.species.in.orthogroup, levels = perecentage_stats$Number.of.genes.per.species.in.orthogroup)
perecentage_stats <- melt(perecentage_stats)
## Using Number.of.genes.per.species.in.orthogroup as id variables
ggplot(perecentage_stats, aes(x=Number.of.genes.per.species.in.orthogroup , y=value, fill=variable)) + geom_bar(stat="identity", position = "dodge") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "bottom") + ylab("Percentage (%)")
- Why are these results expected?
Location:
/raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1/Orthogroups/Orthogroups.GeneCount.tsv
The Statistics per Species and Statistics Overall files provided us with a general view of the contribution of each genome and each orthogroup into our orthology analysis. For a more detailed look into our results, we should look at the Orthogroups GeneCount file This file includes the number of genes per species that are grouped inside each predicted orthogroup (OG).
Lets take a look at the first 5 rows of the file:
og.gene_count <- read.table("Results/Orthogroups.GeneCount.tsv", header = T)
head(og.gene_count) %>% knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Orthogroup |
Amanita_muscaria |
Arabidopsis_thaliana |
Canis_lupus |
Dyctiostelium_discoideum |
Paramecium_tetraurelia |
Phytophthora_infestans |
Total |
---|---|---|---|---|---|---|---|
OG0000000 |
238 |
0 |
0 |
0 |
0 |
0 |
238 |
OG0000001 |
1 |
234 |
0 |
0 |
0 |
1 |
236 |
OG0000002 |
200 |
0 |
0 |
0 |
0 |
2 |
202 |
OG0000003 |
2 |
24 |
0 |
15 |
0 |
84 |
125 |
OG0000004 |
2 |
46 |
35 |
10 |
5 |
20 |
118 |
OG0000005 |
0 |
117 |
0 |
0 |
0 |
0 |
117 |
The results of the Orthogroups GeneCount file are sorted by number of total genes per orthogroup. These results show that the contributions of each species into each orthogroup are quite different. For OG0000000 we see all genes (238 genes) are contributed by Amanita muscaria. This is a Single species orthogroup.
Conversely, OG0000003, OG0000004 and OG0000005 are comprised by many genes of different species. So these are Shared species orthogroups.
We can observe the general pattern of the orthogroups by plotting a heat
map for the first 200 orthogroups without the Total
column
library(vcfR)
##
## ***** *** vcfR *** *****
## This is vcfR 1.9.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
OG.mtrix <- as.matrix(og.gene_count[c(1:200),-c(1,8)])
rownames(OG.mtrix) <- og.gene_count[c(1:200),1]
OG.mtrix[OG.mtrix == 0] <- NA
heatmap.bp(OG.mtrix)
Several things can be concluded from that figure. First, that the distribution of species that contribute in each OG is patchy. Second, that for the first 200 OG, the species that contribute the most genes are A. thaliana and C. lupus. Third, that there are some OG that have genes in every species. When we review the results of the Statistics Overall table, we remember that 1319 OG have all species present. We should be able to see the contributions of each species per OG by removing all OG with values of 0 and counting the remaining OG:
non_zero_OG <- og.gene_count[apply(og.gene_count[,-1], 1, function (x) sum(x == 0) == 0),]
nrow(non_zero_OG)
## [1] 1319
Woohoo! Lets see the first 10 rows of the OG with all species:
head(non_zero_OG, n=10) %>% knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Orthogroup |
Amanita_muscaria |
Arabidopsis_thaliana |
Canis_lupus |
Dyctiostelium_discoideum |
Paramecium_tetraurelia |
Phytophthora_infestans |
Total |
|
---|---|---|---|---|---|---|---|---|
5 |
OG0000004 |
2 |
46 |
35 |
10 |
5 |
20 |
118 |
11 |
OG0000010 |
6 |
7 |
35 |
8 |
29 |
15 |
100 |
18 |
OG0000017 |
8 |
9 |
10 |
10 |
32 |
14 |
83 |
24 |
OG0000023 |
10 |
18 |
19 |
13 |
7 |
13 |
80 |
25 |
OG0000024 |
24 |
18 |
20 |
14 |
2 |
1 |
79 |
27 |
OG0000026 |
8 |
19 |
24 |
6 |
16 |
5 |
78 |
35 |
OG0000034 |
3 |
12 |
20 |
3 |
21 |
15 |
74 |
42 |
OG0000041 |
5 |
20 |
31 |
2 |
11 |
3 |
72 |
44 |
OG0000043 |
4 |
7 |
34 |
7 |
11 |
7 |
70 |
45 |
OG0000044 |
2 |
2 |
28 |
5 |
19 |
14 |
70 |
What about the last 10 rows?
tail(non_zero_OG, n=10) %>% knitr::kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Orthogroup |
Amanita_muscaria |
Arabidopsis_thaliana |
Canis_lupus |
Dyctiostelium_discoideum |
Paramecium_tetraurelia |
Phytophthora_infestans |
Total |
|
---|---|---|---|---|---|---|---|---|
8294 |
OG0008293 |
1 |
1 |
1 |
1 |
1 |
1 |
6 |
8304 |
OG0008303 |
1 |
1 |
1 |
1 |
1 |
1 |
6 |
8316 |
OG0008315 |
1 |
1 |
1 |
1 |
1 |
1 |
6 |
8328 |
OG0008327 |
1 |
1 |
1 |
1 |
1 |
1 |
6 |
8330 |
OG0008329 |
1 |
1 |
1 |
1 |
1 |
1 |
6 |
8334 |
OG0008333 |
1 |
1 |
1 |
1 |
1 |
1 |
6 |
8338 |
OG0008337 |
1 |
1 |
1 |
1 |
1 |
1 |
6 |
8342 |
OG0008341 |
1 |
1 |
1 |
1 |
1 |
1 |
6 |
8344 |
OG0008343 |
1 |
1 |
1 |
1 |
1 |
1 |
6 |
8355 |
OG0008354 |
1 |
1 |
1 |
1 |
1 |
1 |
6 |
-
What is the biological importance of these OG that have genes in every species?
-
What are the last 10 rows showing us? Why are these OG important?
Finally, we will focus in two very important results of OrthoFinder: The Single copy orthologs and the Species Tree.
The Single copy orthologs are homologous and orthologous genes that
exist in every single species and only have one copy. These genes are of
fundamental importance for molecular evolution, as are genes that can be
used to study the impacts of selection or drift across our data set. In
our case, the tree of life. We can identify those genes by summarizing
the Orthogroups GeneCount file, or by summarizing the number of
.fasta
files in the Single_Copy_Orthologue_Sequences folder in
our OrthoGroups folder.
We have a total of 32 single copy ortholog groups across our data set. If we want to know the function, we can use a very well annotated genome and search for the function.
For example, OG0008354.fa
has the A. thaliana gene NP_193902.1
. We
can go to NBCI and check the annotation of this gene:
As we can see, the NP_193902.1
gene has the annotation associated with
DNA-directed RNA polymerase family protein
, so we can infer that this
function is conserved across the tree of life.
In addition, we can see if this protein sequence has evolved differently
across the tree of life. To do so, we can do a multiple sequence
alignment of the OG0008354.fa
gene using
MAFFT
The MSA statistics show that this protein family has 366 (28.5%) identical sites, with a pairwise Identity: of 56.0%. So its quite the conserved gene across the eukaryotic tree of life.
-
What are the other functions of the other single copy OG?
-
Are the other single copy OG more or less conserved phylogenetically than OG0008354?
The Species Tree is the phylogenetic reconstruction of the species used in this orthology analysis. How is it calculated? According to Emms and Kelly, 2019:
OrthoFinder infers orthologs from the orthogroup trees (a gene tree for the orthogroup) using the steps shown in Fig. 2. Input proteomes are provided by the user using one FASTA file per species. Each file contains the amino acid sequences for the proteins in that species. Orthogroups are inferred using the original OrthoFinder algorithm [10]; an unrooted gene tree is inferred for each orthogroup using DendroBLAST [24]; the unrooted species tree is inferred from this set of unrooted orthogroup trees using the STAG algorithm [33]; this STAG species tree is then rooted using the STRIDE algorithm by identifying high-confidence gene duplication events in the complete set of unrooted orthogroup trees [22]; the rooted species tree is used to root the orthogroup trees; orthologs and gene duplication events are inferred from the rooted orthogroup trees by a novel hybrid algorithm that combines the “species-overlap” method [31] and the duplication-loss-coalescent model [32] (described below); and comparative statistics are calculated.
So, how does the species tree look? We can find it at
~/orthofinder_test/OrthoFinder/Results_Feb06_1/Species_Tree/SpeciesTree_rooted.txt
.
We can open it on R:
library(ggtree)
## Registered S3 method overwritten by 'treeio':
## method from
## root.phylo ape
## ggtree v1.16.6 For help: https://yulab-smu.github.io/treedata-book/
##
## If you use ggtree in published research, please cite the most appropriate paper(s):
##
## �[36m-�[39m Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution 2018, 35(12):3041-3043. doi: 10.1093/molbev/msy194
## �[36m-�[39m Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:tidyr':
##
## expand
## The following object is masked from 'package:magrittr':
##
## inset
species.tree <- read.tree("Results/SpeciesTree_rooted.txt.tre")
ggtree(species.tree, layout = "circular") + geom_tiplab2(size=3) + xlim(0, 1)
- How does the tree look? Does it make sense biologically?