OrthoFinder tutorial for BUG

Javier F. Tabima February 12, 2020

Objectives
Data sets
Procedure
- Obtaining data
- Summary of data
Downloading Orthofinder
Running OrthoFinder
OrthoFinder results

Objectives

Assessments of homology using genomic data allow for the identification of genes under selection, expansion or reduction of genic families, analysis of presence or absence of important genes, and identification of core genes across different taxonomic levels. These assays based on orthologous gene searches come in different flavors, and many tools have been created to identify homologous genes across different genomes (e.g. OrthoMCL, InParanoid, FastOrtho, etc).

In this tutorial we will focus on OrthoFinder (Emms et al. 2015; https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0721-2), a fast and simple software focused on phylogenetic orthology inference. We will focus on the basics of this program, how to run it in the cluster, and how to interpret results to identify core homologous genes, genic duplication and how OrthoFinder reconstructs species trees based on orthologous gene data.

Data sets

OrthoFinder uses protein sequences in FASTA format and it performs optimally on protein sequences predicted from whole genome sequences. For this tutorial we want to explore orthology across the Eukaryotic tree of life. We will use six different species across the tree of life:

Canis lupus (Dog)
Amanita muscaria (Fly agaric)
Phytophthora infestans (Late blight of potato and tomato)
Arabidopsis thaliana (thale cress)
Dyctiostelium discoideum (Cellular slime mold)
Paramecium tetraurelia

Procedure

Obtaining data

Download all of the samples to a folder in the cluster. The folder can be any folder that you have access to. In my case, I’ll download everything in my home/ folder

cd ~
mkdir orthofinder_test
cd orthofinder_test

To download these files into the structure you can either download them all separately or one by one. Here are the commands I use:

SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.3_CanFam3.1/GCF_000002285.3_CanFam3.1_protein.faa.gz > Canis_lupus.fasta.gz' -r curl_1
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/827/485/GCA_000827485.1_Amanita_muscaria_Koide_BX008_v1.0/GCA_000827485.1_Amanita_muscaria_Koide_BX008_v1.0_protein.faa.gz > Amanita_muscaria.fasta.gz' -r curl_2
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/142/945/GCF_000142945.1_ASM14294v1/GCF_000142945.1_ASM14294v1_protein.faa.gz  > Phytophthora_infestans.fasta.gz' -r curl_3
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_protein.faa.gz > Arabidopsis_thaliana.fasta.gz' -r curl_4
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/695/GCF_000004695.1_dicty_2.7/GCF_000004695.1_dicty_2.7_protein.faa.gz > Dyctiostelium_discoideum.fasta.gz' -r curl_5
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/165/425/GCF_000165425.1_ASM16542v1/GCF_000165425.1_ASM16542v1_protein.faa.gz > Paramecium_tetraurelia.fasta.gz' -r curl_6

Decompress the fasta files

SGE_Batch -c 'bash; for i in *.gz; do gunzip $i; done' -r gunzip

Remove all the unnecessary files

rm -rf curl_*/
rm -rf gunzip/

Summary of data

To estimate the number of proteins per downloaded genome we can count the number of sequences in each of the FASTA files we downloaded. To do so you can count the number of > characters per file. This is how I would do it:

for i in *.fasta; do a=$(grep -c ">" $i); printf $i"\t"$a"\n"; done

The results should look like this:

FASTA file	Number of proteins
Amanita_muscaria.fasta	18,093
Arabidopsis_thaliana.fasta	48,265
Canis_lupus.fasta	58,774
Dyctiostelium_discoideum.fasta	13,315
Paramecium_tetraurelia.fasta	39,580
Phytophthora_infestans.fasta	17,797

Downloading Orthofinder

Download the latest OrthoFinder.tar.gz release from github:

cd ~
wget https://github.com/davidemms/OrthoFinder/releases/latest/download/OrthoFinder.tar.gz

Extract the files:

tar xzf OrthoFinder.tar.gz

Test OrthoFinder:

cd OrthoFinder/
./orthofinder -h

If OrthoFinder works correctly, OrthoFinder should print its ‘help’ text.

Running OrthoFinder

For this example, we will run OrthoFinder using the default settings. One of the greatest advantages of OrthoFinder is that the only input it requires is the folder where the FASTA sequences are. In our case, the FASTA aminoacid files are contained in the ~/orthofinder_test folder. Make sure you have the absolute path of your folder. In my example, the path is /raid1/home/bpp/tabimaj/orthofinder_test

So, to run OrthoFinder while in the ~/OrthoFinder/ folder, you can run the analysis using the following command by requesting 6 processors/slots from the CGRB cluster:

SGE_Batch -c './orthofinder -f /raid1/home/bpp/tabimaj/orthofinder_test -t 8' -P 8 -r BUG_orthofinder

OrthoFinder results

The run of OrthoFinder took 2h 20mins using eight threads at the CGRB infrastructure. The host which was OrthoFinder ran at was symbiosis.

The results of your personal OrthoFinder run should be found in the ~/orthofinder_test/Results_Feb12/ folder. For an example run, I have a folder that will CGRB users read the files from an identical OrthoFinder folder at /raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1

While I recommend you to go to the OrthoFinder webpage to understand what files and results come as output from the program, we will focus in four main results today:

The Statistics Overall file
The Statistics Per Species file
The count of genes per Orthogroup predicted by OrthoFinder
The Single copy orthologs and the Species Tree

Statistics Overall

Location: /raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1/Comparative_Genomics_Statistics/Statistics_Overall.tsv

The Statistics Overall file is a tab-separated text file that contains general statistics about orthogroup sizes and proportion of genes assigned to orthogroups.

As mentioned by the OrthoFinder webpage:

Most of the terms in the files ‘Statistics_Overall.csv’ and ‘Statistics_PerSpecies.csv’ are self-explanatory, the remainder are defined below.

Species-specific orthogroup: An orthogroups that consist entirely of genes from one species.

G50: The number of genes in the orthogroup such that 50% of genes are in orthogroups of that size or larger.

O50: The smallest number of orthogroups such that 50% of genes are in orthogroups of that size or larger.

Single-copy orthogroup: An orthogroup with exactly one gene (and no more) from each species. These orthogroups are ideal for inferring a species tree and many other analyses.

Unassigned gene: A gene that has not been put into an orthogroup with any other genes.

Lets take a look at the file. Even if this is a “tab-separated” file, we need to split it on sections:

Section 1: Overall statistics (Lines 1 to 24)

library(magrittr)
library(knitr)
library(kableExtra)

knitr::kable(read.table("Results/Statistics_Overall.tsv",fill = T, nrows = 24, sep = "\t")) %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

V1	V2
Number of species	6
Number of genes	195824
Number of genes in orthogroups	172946
Number of unassigned genes	22878
Percentage of genes in orthogroups	88.3
Percentage of unassigned genes	11.7
Number of orthogroups	25731
Number of species-specific orthogroups	19632
Number of genes in species-specific orthogroups	111223
Percentage of genes in species-specific orthogroups	56.8
Mean orthogroup size	6.7
Median orthogroup size	4.0
G50 (assigned genes)	10
G50 (all genes)	9
O50 (assigned genes)	4442
O50 (all genes)	5664
Number of orthogroups with all species present	1319
Number of single-copy orthogroups	32
Date	2020-02-06
Orthogroups file	Orthogroups.tsv
Unassigned genes file	Orthogroups_UnassignedGenes.tsv
Per-species statistics	Statistics_PerSpecies.tsv
Overall statistics	Statistics_Overall.tsv
Orthogroups shared between species	Orthogroups_SpeciesOverlaps.tsv

Questions (Section 1):

What is the number of orthogroups that consist entirely of genes from one species? Is it the majority of data? What percentage of the orthogroups is is?
What is the number of single-copy orthogroups found in the analysis? What does this mean biologically?

Section 2: Genes per Orthogroup (Lines 26-45)

genes_per_OG <- read.delim("Results/Statistics_Overall.tsv",fill = T, skip = 25, nrows = 20, sep = "\t")
kable(genes_per_OG) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Average.number.of.genes.per.species.in.orthogroup	Number.of.orthogroups	Percentage.of.orthogroups	Number.of.genes	Percentage.of.genes
<1	16108	62.6	47651	27.6
’1	6015	23.4	47718	27.6
’2	1887	7.3	26446	15.3
’3	737	2.9	14887	8.6
’4	399	1.6	10460	6.0
’5	236	0.9	7651	4.4
’6	130	0.5	4987	2.9
’7	70	0.3	3103	1.8
’8	45	0.2	2268	1.3
’9	28	0.1	1569	0.9
’10	21	0.1	1297	0.7
11-15	44	0.2	3353	1.9
16-20	8	0.0	880	0.5
21-50	3	0.0	676	0.4
51-100	0	0.0	0	0.0
101-150	0	0.0	0	0.0
151-200	0	0.0	0	0.0
201-500	0	0.0	0	0.0
501-1000	0	0.0	0	0.0
’1001+	0	0.0	0	0.0

I think its easier to see these results on a plot than in the table

library(reshape2)
library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::extract()    masks magrittr::extract()
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()
## x purrr::set_names()  masks magrittr::set_names()

genes_per_OG.m <- melt(genes_per_OG, id.vars = "Average.number.of.genes.per.species.in.orthogroup", factorsAsStrings = T)
genes_per_OG.m$Average.number.of.genes.per.species.in.orthogroup <- factor(genes_per_OG.m$Average.number.of.genes.per.species.in.orthogroup,levels = as.character(genes_per_OG$Average.number.of.genes.per.species.in.orthogroup))

ggplot(genes_per_OG.m, aes(x=Average.number.of.genes.per.species.in.orthogroup , y=value, fill=variable)) + geom_bar(stat="identity") + facet_grid(variable~., scales = "free_y") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none")

Questions (Section 2):

What is the overall pattern of the results we find?

Section 3: Species per Orthogroup (Lines 47-54)

species_per_OG <- read.delim("Results/Statistics_Overall.tsv",fill = T, skip = 46, nrows = 7, sep = "\t", header = T)
kable(species_per_OG) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Number.of.species.in.orthogroup	Number.of.orthogroups
1	19632
2	2027
3	1128
4	774
5	851
6	1319

ggplot(species_per_OG, aes(x=as.factor(Number.of.species.in.orthogroup), y=Number.of.orthogroups, fill=as.factor(Number.of.species.in.orthogroup))) + geom_bar(stat="identity") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none")  + xlab("Number of species on orthogroup") + ylab("Number of orthogroups")

Questions (Section 3):

The highest number of orthogroups contain only one species. What are the biological implications of this result?
If OrthoFinder was used in a subset of highly-similar species, would this result be similar or would it change? How so?

Statistics per Species

Location: /raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1/Comparative_Genomics_Statistics/Statistics_PerSpecies.tsv

The Statistics per Species file is a tab-separated text file that contains general statistics about the number of genes that are within each category of orthogroups for each species. As with the Statistics Overall table, we have to divide it into different sections:

Section 1: Overall statistics (Lines 1 to 24)

overall_stats <- read.table("Results/Statistics_PerSpecies.tsv",fill = T, nrows = 10, sep = "\t", header = T)
species_names <- colnames(overall_stats)[-1]

knitr::kable(overall_stats) %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

X	Amanita_muscaria	Arabidopsis_thaliana	Canis_lupus	Dyctiostelium_discoideum	Paramecium_tetraurelia	Phytophthora_infestans
Number of genes	18093.0	48265.0	58774.0	13315.0	39580.0	17797.0
Number of genes in orthogroups	12895.0	44825.0	56239.0	9889.0	35093.0	14005.0
Number of unassigned genes	5198.0	3440.0	2535.0	3426.0	4487.0	3792.0
Percentage of genes in orthogroups	71.3	92.9	95.7	74.3	88.7	78.7
Percentage of unassigned genes	28.7	7.1	4.3	25.7	11.3	21.3
Number of orthogroups containing species	4719.0	8553.0	9321.0	4723.0	9480.0	5539.0
Percentage of orthogroups containing species	18.3	33.2	36.2	18.4	36.8	21.5
Number of species-specific orthogroups	1310.0	4665.0	4930.0	769.0	6636.0	1322.0
Number of genes in species-specific orthogroups	7616.0	28905.0	36927.0	4152.0	26548.0	7075.0
Percentage of genes in species-specific orthogroups	42.1	59.9	62.8	31.2	67.1	39.8

As these results are very numerous, lets graph them. We will do a subset of the non-percentage data first:

overall_stats.number <- overall_stats[grep(x = overall_stats$X, pattern = "Number"),]
overall_stats.number <- melt(overall_stats.number)

## Using X as id variables

overall_stats.number$X <- gsub(overall_stats.number$X, pattern = "Number of ", replacement = "") %>% tools::toTitleCase()

ggplot(overall_stats.number, aes(x=variable , y=value, fill=X)) + geom_bar(stat="identity") + facet_grid(X~., scales = "free_y") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none") + ylab("Number")

Questions (Section 1 - Numeric results):

What is the species with the most and the least number of genes used in the analysis?
What does the Species specific orthologs category mean? Is this representing paralogy?
Does this table/plot accurately represent the contribution of each species to the orthology assay?

While the numeric results provide us with an overall assay of the general results of the orthologous assay, the data formatted in percentages may prove a more standardized vision of the results:

overall_stats.percentage <- overall_stats[grep(x = overall_stats$X, pattern = "Number", invert = T),]
overall_stats.percentage <- melt(overall_stats.percentage)

## Using X as id variables

overall_stats.percentage$X <- gsub(overall_stats.percentage$X, pattern = "Percentage of ", replacement = "") %>% tools::toTitleCase()

ggplot(overall_stats.percentage, aes(x=variable , y=value, fill=X)) + geom_bar(stat="identity") + facet_grid(X~., scales = "free_y") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none") + ylab("Percentage (%)")

Questions (Section 1 - Percentage results):

What is the species with the most shared genes across the entire data set?

Section 2. Per species- per ortholog statistics (Lines 14 to 56)

The second section of the Statistics per Species file includes all the statistics of the number and percentage of orthologs that contain data for each species. For today we will focus only on the percentages (Lines 35 - 56):

perecentage_stats <- read.delim("Results/Statistics_PerSpecies.tsv",fill = T, skip = 35, nrows = 20, sep = "\t", header = T)
colnames(perecentage_stats)[-1] <- species_names

knitr::kable(perecentage_stats) %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Number.of.genes.per.species.in.orthogroup	Amanita_muscaria	Arabidopsis_thaliana	Canis_lupus	Dyctiostelium_discoideum	Paramecium_tetraurelia	Phytophthora_infestans
’0	81.7	66.8	63.8	81.6	63.2	78.5
’1	10.7	4.1	5.1	12.5	3.1	12.1
’2	3.6	8.9	6.8	3.1	16.0	4.7
’3	1.4	5.7	5.0	1.1	6.1	1.7
’4	0.7	3.7	3.7	0.5	4.4	0.9
’5	0.4	2.4	2.9	0.3	2.0	0.5
’6	0.3	1.8	2.1	0.2	1.5	0.3
’7	0.2	1.3	1.8	0.2	0.9	0.3
’8	0.1	0.9	1.4	0.1	0.6	0.2
’9	0.1	0.7	1.1	0.1	0.5	0.2
’10	0.1	0.5	0.9	0.1	0.2	0.1
11-15	0.3	1.5	2.5	0.2	0.7	0.3
16-20	0.1	0.7	1.1	0.1	0.3	0.2
21-50	0.2	1.1	1.5	0.1	0.4	0.2
51-100	0.0	0.1	0.1	0.0	0.1	0.0
101-150	0.0	0.0	0.0	0.0	0.0	0.0
151-200	0.0	0.0	0.0	0.0	0.0	0.0
201-500	0.0	0.0	0.0	0.0	0.0	0.0
501-1000	0.0	0.0	0.0	0.0	0.0	0.0
’1001+	0.0	0.0	0.0	0.0	0.0	0.0

perecentage_stats$Number.of.genes.per.species.in.orthogroup <- factor(perecentage_stats$Number.of.genes.per.species.in.orthogroup, levels = perecentage_stats$Number.of.genes.per.species.in.orthogroup)

perecentage_stats <- melt(perecentage_stats)

## Using Number.of.genes.per.species.in.orthogroup as id variables

ggplot(perecentage_stats, aes(x=Number.of.genes.per.species.in.orthogroup , y=value, fill=variable)) + geom_bar(stat="identity", position = "dodge") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "bottom") + ylab("Percentage (%)")

Questions (Section 2 - Percentage results):

Why are these results expected?

The count of genes per Orthogroup predicted by OrthoFinder

Location: /raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1/Orthogroups/Orthogroups.GeneCount.tsv

The Statistics per Species and Statistics Overall files provided us with a general view of the contribution of each genome and each orthogroup into our orthology analysis. For a more detailed look into our results, we should look at the Orthogroups GeneCount file This file includes the number of genes per species that are grouped inside each predicted orthogroup (OG).

Lets take a look at the first 5 rows of the file:

og.gene_count <- read.table("Results/Orthogroups.GeneCount.tsv", header = T)

head(og.gene_count) %>% knitr::kable() %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Orthogroup	Amanita_muscaria	Arabidopsis_thaliana	Canis_lupus	Dyctiostelium_discoideum	Paramecium_tetraurelia	Phytophthora_infestans	Total
OG0000000	238	0	0	0	0	0	238
OG0000001	1	234	0	0	0	1	236
OG0000002	200	0	0	0	0	2	202
OG0000003	2	24	0	15	0	84	125
OG0000004	2	46	35	10	5	20	118
OG0000005	0	117	0	0	0	0	117

The results of the Orthogroups GeneCount file are sorted by number of total genes per orthogroup. These results show that the contributions of each species into each orthogroup are quite different. For OG0000000 we see all genes (238 genes) are contributed by Amanita muscaria. This is a Single species orthogroup.

Conversely, OG0000003, OG0000004 and OG0000005 are comprised by many genes of different species. So these are Shared species orthogroups.

We can observe the general pattern of the orthogroups by plotting a heat map for the first 200 orthogroups without the Total column

library(vcfR)

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.9.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

OG.mtrix <- as.matrix(og.gene_count[c(1:200),-c(1,8)])
rownames(OG.mtrix) <- og.gene_count[c(1:200),1]
OG.mtrix[OG.mtrix == 0] <- NA
heatmap.bp(OG.mtrix)

Several things can be concluded from that figure. First, that the distribution of species that contribute in each OG is patchy. Second, that for the first 200 OG, the species that contribute the most genes are A. thaliana and C. lupus. Third, that there are some OG that have genes in every species. When we review the results of the Statistics Overall table, we remember that 1319 OG have all species present. We should be able to see the contributions of each species per OG by removing all OG with values of 0 and counting the remaining OG:

non_zero_OG <- og.gene_count[apply(og.gene_count[,-1], 1, function (x) sum(x == 0) == 0),]
nrow(non_zero_OG)

## [1] 1319

Woohoo! Lets see the first 10 rows of the OG with all species:

head(non_zero_OG, n=10) %>% knitr::kable() %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Orthogroup	Amanita_muscaria	Arabidopsis_thaliana	Canis_lupus	Dyctiostelium_discoideum	Paramecium_tetraurelia	Phytophthora_infestans	Total
5	OG0000004	2	46	35	10	5	20	118
11	OG0000010	6	7	35	8	29	15	100
18	OG0000017	8	9	10	10	32	14	83
24	OG0000023	10	18	19	13	7	13	80
25	OG0000024	24	18	20	14	2	1	79
27	OG0000026	8	19	24	6	16	5	78
35	OG0000034	3	12	20	3	21	15	74
42	OG0000041	5	20	31	2	11	3	72
44	OG0000043	4	7	34	7	11	7	70
45	OG0000044	2	2	28	5	19	14	70

What about the last 10 rows?

tail(non_zero_OG, n=10) %>% knitr::kable() %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Orthogroup	Amanita_muscaria	Arabidopsis_thaliana	Canis_lupus	Dyctiostelium_discoideum	Paramecium_tetraurelia	Phytophthora_infestans	Total
8294	OG0008293	1	1	1	1	1	1	6
8304	OG0008303	1	1	1	1	1	1	6
8316	OG0008315	1	1	1	1	1	1	6
8328	OG0008327	1	1	1	1	1	1	6
8330	OG0008329	1	1	1	1	1	1	6
8334	OG0008333	1	1	1	1	1	1	6
8338	OG0008337	1	1	1	1	1	1	6
8342	OG0008341	1	1	1	1	1	1	6
8344	OG0008343	1	1	1	1	1	1	6
8355	OG0008354	1	1	1	1	1	1	6

Questions (Section 3):

What is the biological importance of these OG that have genes in every species?
What are the last 10 rows showing us? Why are these OG important?

The Single copy orthologs and the Species Tree

Finally, we will focus in two very important results of OrthoFinder: The Single copy orthologs and the Species Tree.

The Single copy orthologs

The Single copy orthologs are homologous and orthologous genes that exist in every single species and only have one copy. These genes are of fundamental importance for molecular evolution, as are genes that can be used to study the impacts of selection or drift across our data set. In our case, the tree of life. We can identify those genes by summarizing the Orthogroups GeneCount file, or by summarizing the number of .fasta files in the Single_Copy_Orthologue_Sequences folder in our OrthoGroups folder.

We have a total of 32 single copy ortholog groups across our data set. If we want to know the function, we can use a very well annotated genome and search for the function.

For example, OG0008354.fa has the A. thaliana gene NP_193902.1. We can go to NBCI and check the annotation of this gene:

As we can see, the NP_193902.1 gene has the annotation associated with DNA-directed RNA polymerase family protein, so we can infer that this function is conserved across the tree of life.

In addition, we can see if this protein sequence has evolved differently across the tree of life. To do so, we can do a multiple sequence alignment of the OG0008354.fa gene using MAFFT

The MSA statistics show that this protein family has 366 (28.5%) identical sites, with a pairwise Identity: of 56.0%. So its quite the conserved gene across the eukaryotic tree of life.

Questions (Section 4 - Single copy orthologs):

What are the other functions of the other single copy OG?
Are the other single copy OG more or less conserved phylogenetically than OG0008354?

The Species tree

The Species Tree is the phylogenetic reconstruction of the species used in this orthology analysis. How is it calculated? According to Emms and Kelly, 2019:

OrthoFinder infers orthologs from the orthogroup trees (a gene tree for the orthogroup) using the steps shown in Fig. 2. Input proteomes are provided by the user using one FASTA file per species. Each file contains the amino acid sequences for the proteins in that species. Orthogroups are inferred using the original OrthoFinder algorithm [10]; an unrooted gene tree is inferred for each orthogroup using DendroBLAST [24]; the unrooted species tree is inferred from this set of unrooted orthogroup trees using the STAG algorithm [33]; this STAG species tree is then rooted using the STRIDE algorithm by identifying high-confidence gene duplication events in the complete set of unrooted orthogroup trees [22]; the rooted species tree is used to root the orthogroup trees; orthologs and gene duplication events are inferred from the rooted orthogroup trees by a novel hybrid algorithm that combines the “species-overlap” method [31] and the duplication-loss-coalescent model [32] (described below); and comparative statistics are calculated.

So, how does the species tree look? We can find it at ~/orthofinder_test/OrthoFinder/Results_Feb06_1/Species_Tree/SpeciesTree_rooted.txt. We can open it on R:

library(ggtree)

## Registered S3 method overwritten by 'treeio':
##   method     from
##   root.phylo ape

## ggtree v1.16.6  For help: https://yulab-smu.github.io/treedata-book/
## 
## If you use ggtree in published research, please cite the most appropriate paper(s):
## 
## �[36m-�[39m Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution 2018, 35(12):3041-3043. doi: 10.1093/molbev/msy194
## �[36m-�[39m Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628

## 
## Attaching package: 'ggtree'

## The following object is masked from 'package:tidyr':
## 
##     expand

## The following object is masked from 'package:magrittr':
## 
##     inset

species.tree <- read.tree("Results/SpeciesTree_rooted.txt.tre")

ggtree(species.tree, layout = "circular") + geom_tiplab2(size=3) + xlim(0, 1)

Questions (Section 4 - Species tree):

How does the tree look? Does it make sense biologically?

Tabima/BUG_Orthofinder

OrthoFinder tutorial for BUG

Objectives

Data sets

Procedure

Obtaining data

Summary of data

Downloading Orthofinder

Running OrthoFinder

OrthoFinder results

Statistics Overall

Section 1: Overall statistics (Lines 1 to 24)

Questions (Section 1):

Section 2: Genes per Orthogroup (Lines 26-45)

Questions (Section 2):

Section 3: Species per Orthogroup (Lines 47-54)

Questions (Section 3):

Statistics per Species

Section 1: Overall statistics (Lines 1 to 24)

Questions (Section 1 - Numeric results):

Questions (Section 1 - Percentage results):

Section 2. Per species- per ortholog statistics (Lines 14 to 56)

Questions (Section 2 - Percentage results):

The count of genes per Orthogroup predicted by OrthoFinder

Questions (Section 3):

The Single copy orthologs and the Species Tree

The Single copy orthologs

Questions (Section 4 - Single copy orthologs):

The **Species tree **

Questions (Section 4 - Species tree):

The Species tree