/BUG_Orthofinder

A repo for Orthofinder in CGRB BUG

OrthoFinder tutorial for BUG

Javier F. Tabima February 12, 2020


Objectives

Assessments of homology using genomic data allow for the identification of genes under selection, expansion or reduction of genic families, analysis of presence or absence of important genes, and identification of core genes across different taxonomic levels. These assays based on orthologous gene searches come in different flavors, and many tools have been created to identify homologous genes across different genomes (e.g. OrthoMCL, InParanoid, FastOrtho, etc).

In this tutorial we will focus on OrthoFinder (Emms et al. 2015; https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0721-2), a fast and simple software focused on phylogenetic orthology inference. We will focus on the basics of this program, how to run it in the cluster, and how to interpret results to identify core homologous genes, genic duplication and how OrthoFinder reconstructs species trees based on orthologous gene data.


Data sets

OrthoFinder uses protein sequences in FASTA format and it performs optimally on protein sequences predicted from whole genome sequences. For this tutorial we want to explore orthology across the Eukaryotic tree of life. We will use six different species across the tree of life:

  • Canis lupus (Dog)
  • Amanita muscaria (Fly agaric)
  • Phytophthora infestans (Late blight of potato and tomato)
  • Arabidopsis thaliana (thale cress)
  • Dyctiostelium discoideum (Cellular slime mold)
  • Paramecium tetraurelia

Procedure

Obtaining data

  1. Download all of the samples to a folder in the cluster. The folder can be any folder that you have access to. In my case, I’ll download everything in my home/ folder
cd ~
mkdir orthofinder_test
cd orthofinder_test
  1. To download these files into the structure you can either download them all separately or one by one. Here are the commands I use:
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.3_CanFam3.1/GCF_000002285.3_CanFam3.1_protein.faa.gz > Canis_lupus.fasta.gz' -r curl_1
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/827/485/GCA_000827485.1_Amanita_muscaria_Koide_BX008_v1.0/GCA_000827485.1_Amanita_muscaria_Koide_BX008_v1.0_protein.faa.gz > Amanita_muscaria.fasta.gz' -r curl_2
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/142/945/GCF_000142945.1_ASM14294v1/GCF_000142945.1_ASM14294v1_protein.faa.gz  > Phytophthora_infestans.fasta.gz' -r curl_3
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_protein.faa.gz > Arabidopsis_thaliana.fasta.gz' -r curl_4
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/004/695/GCF_000004695.1_dicty_2.7/GCF_000004695.1_dicty_2.7_protein.faa.gz > Dyctiostelium_discoideum.fasta.gz' -r curl_5
SGE_Batch -c 'curl ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/165/425/GCF_000165425.1_ASM16542v1/GCF_000165425.1_ASM16542v1_protein.faa.gz > Paramecium_tetraurelia.fasta.gz' -r curl_6
  1. Decompress the fasta files
SGE_Batch -c 'bash; for i in *.gz; do gunzip $i; done' -r gunzip
  1. Remove all the unnecessary files
rm -rf curl_*/
rm -rf gunzip/

Summary of data

To estimate the number of proteins per downloaded genome we can count the number of sequences in each of the FASTA files we downloaded. To do so you can count the number of > characters per file. This is how I would do it:

for i in *.fasta; do a=$(grep -c ">" $i); printf $i"\t"$a"\n"; done

The results should look like this:

FASTA file Number of proteins
Amanita_muscaria.fasta 18,093
Arabidopsis_thaliana.fasta 48,265
Canis_lupus.fasta 58,774
Dyctiostelium_discoideum.fasta 13,315
Paramecium_tetraurelia.fasta 39,580
Phytophthora_infestans.fasta 17,797

Downloading Orthofinder

  • Download the latest OrthoFinder.tar.gz release from github:
cd ~
wget https://github.com/davidemms/OrthoFinder/releases/latest/download/OrthoFinder.tar.gz
  • Extract the files:
tar xzf OrthoFinder.tar.gz
  • Test OrthoFinder:
cd OrthoFinder/
./orthofinder -h

If OrthoFinder works correctly, OrthoFinder should print its ‘help’ text.

Running OrthoFinder

For this example, we will run OrthoFinder using the default settings. One of the greatest advantages of OrthoFinder is that the only input it requires is the folder where the FASTA sequences are. In our case, the FASTA aminoacid files are contained in the ~/orthofinder_test folder. Make sure you have the absolute path of your folder. In my example, the path is /raid1/home/bpp/tabimaj/orthofinder_test

So, to run OrthoFinder while in the ~/OrthoFinder/ folder, you can run the analysis using the following command by requesting 6 processors/slots from the CGRB cluster:

SGE_Batch -c './orthofinder -f /raid1/home/bpp/tabimaj/orthofinder_test -t 8' -P 8 -r BUG_orthofinder 

OrthoFinder results

The run of OrthoFinder took 2h 20mins using eight threads at the CGRB infrastructure. The host which was OrthoFinder ran at was symbiosis.

The results of your personal OrthoFinder run should be found in the ~/orthofinder_test/Results_Feb12/ folder. For an example run, I have a folder that will CGRB users read the files from an identical OrthoFinder folder at /raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1

While I recommend you to go to the OrthoFinder webpage to understand what files and results come as output from the program, we will focus in four main results today:

  • The Statistics Overall file
  • The Statistics Per Species file
  • The count of genes per Orthogroup predicted by OrthoFinder
  • The Single copy orthologs and the Species Tree

Statistics Overall

Location: /raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1/Comparative_Genomics_Statistics/Statistics_Overall.tsv

The Statistics Overall file is a tab-separated text file that contains general statistics about orthogroup sizes and proportion of genes assigned to orthogroups.

As mentioned by the OrthoFinder webpage:

Most of the terms in the files ‘Statistics_Overall.csv’ and ‘Statistics_PerSpecies.csv’ are self-explanatory, the remainder are defined below.

Species-specific orthogroup: An orthogroups that consist entirely of genes from one species.

G50: The number of genes in the orthogroup such that 50% of genes are in orthogroups of that size or larger.

O50: The smallest number of orthogroups such that 50% of genes are in orthogroups of that size or larger.

Single-copy orthogroup: An orthogroup with exactly one gene (and no more) from each species. These orthogroups are ideal for inferring a species tree and many other analyses.

Unassigned gene: A gene that has not been put into an orthogroup with any other genes.

Lets take a look at the file. Even if this is a “tab-separated” file, we need to split it on sections:

Section 1: Overall statistics (Lines 1 to 24)

library(magrittr)
library(knitr)
library(kableExtra)

knitr::kable(read.table("Results/Statistics_Overall.tsv",fill = T, nrows = 24, sep = "\t")) %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

V1

V2

Number of species

6

Number of genes

195824

Number of genes in orthogroups

172946

Number of unassigned genes

22878

Percentage of genes in orthogroups

88.3

Percentage of unassigned genes

11.7

Number of orthogroups

25731

Number of species-specific orthogroups

19632

Number of genes in species-specific orthogroups

111223

Percentage of genes in species-specific orthogroups

56.8

Mean orthogroup size

6.7

Median orthogroup size

4.0

G50 (assigned genes)

10

G50 (all genes)

9

O50 (assigned genes)

4442

O50 (all genes)

5664

Number of orthogroups with all species present

1319

Number of single-copy orthogroups

32

Date

2020-02-06

Orthogroups file

Orthogroups.tsv

Unassigned genes file

Orthogroups_UnassignedGenes.tsv

Per-species statistics

Statistics_PerSpecies.tsv

Overall statistics

Statistics_Overall.tsv

Orthogroups shared between species

Orthogroups_SpeciesOverlaps.tsv

Questions (Section 1):
  1. What is the number of orthogroups that consist entirely of genes from one species? Is it the majority of data? What percentage of the orthogroups is is?

  2. What is the number of single-copy orthogroups found in the analysis? What does this mean biologically?

Section 2: Genes per Orthogroup (Lines 26-45)

genes_per_OG <- read.delim("Results/Statistics_Overall.tsv",fill = T, skip = 25, nrows = 20, sep = "\t")
kable(genes_per_OG) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Average.number.of.genes.per.species.in.orthogroup

Number.of.orthogroups

Percentage.of.orthogroups

Number.of.genes

Percentage.of.genes

<1

16108

62.6

47651

27.6

’1

6015

23.4

47718

27.6

’2

1887

7.3

26446

15.3

’3

737

2.9

14887

8.6

’4

399

1.6

10460

6.0

’5

236

0.9

7651

4.4

’6

130

0.5

4987

2.9

’7

70

0.3

3103

1.8

’8

45

0.2

2268

1.3

’9

28

0.1

1569

0.9

’10

21

0.1

1297

0.7

11-15

44

0.2

3353

1.9

16-20

8

0.0

880

0.5

21-50

3

0.0

676

0.4

51-100

0

0.0

0

0.0

101-150

0

0.0

0

0.0

151-200

0

0.0

0

0.0

201-500

0

0.0

0

0.0

501-1000

0

0.0

0

0.0

’1001+

0

0.0

0

0.0

I think its easier to see these results on a plot than in the table

library(reshape2)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::extract()    masks magrittr::extract()
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()
## x purrr::set_names()  masks magrittr::set_names()
genes_per_OG.m <- melt(genes_per_OG, id.vars = "Average.number.of.genes.per.species.in.orthogroup", factorsAsStrings = T)
genes_per_OG.m$Average.number.of.genes.per.species.in.orthogroup <- factor(genes_per_OG.m$Average.number.of.genes.per.species.in.orthogroup,levels = as.character(genes_per_OG$Average.number.of.genes.per.species.in.orthogroup))

ggplot(genes_per_OG.m, aes(x=Average.number.of.genes.per.species.in.orthogroup , y=value, fill=variable)) + geom_bar(stat="identity") + facet_grid(variable~., scales = "free_y") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none") 

Questions (Section 2):
  1. What is the overall pattern of the results we find?

Section 3: Species per Orthogroup (Lines 47-54)

species_per_OG <- read.delim("Results/Statistics_Overall.tsv",fill = T, skip = 46, nrows = 7, sep = "\t", header = T)
kable(species_per_OG) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Number.of.species.in.orthogroup

Number.of.orthogroups

1

19632

2

2027

3

1128

4

774

5

851

6

1319

ggplot(species_per_OG, aes(x=as.factor(Number.of.species.in.orthogroup), y=Number.of.orthogroups, fill=as.factor(Number.of.species.in.orthogroup))) + geom_bar(stat="identity") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none")  + xlab("Number of species on orthogroup") + ylab("Number of orthogroups")

Questions (Section 3):
  1. The highest number of orthogroups contain only one species. What are the biological implications of this result?

  2. If OrthoFinder was used in a subset of highly-similar species, would this result be similar or would it change? How so?


Statistics per Species

Location: /raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1/Comparative_Genomics_Statistics/Statistics_PerSpecies.tsv

The Statistics per Species file is a tab-separated text file that contains general statistics about the number of genes that are within each category of orthogroups for each species. As with the Statistics Overall table, we have to divide it into different sections:

Section 1: Overall statistics (Lines 1 to 24)

overall_stats <- read.table("Results/Statistics_PerSpecies.tsv",fill = T, nrows = 10, sep = "\t", header = T)
species_names <- colnames(overall_stats)[-1]

knitr::kable(overall_stats) %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

X

Amanita_muscaria

Arabidopsis_thaliana

Canis_lupus

Dyctiostelium_discoideum

Paramecium_tetraurelia

Phytophthora_infestans

Number of genes

18093.0

48265.0

58774.0

13315.0

39580.0

17797.0

Number of genes in orthogroups

12895.0

44825.0

56239.0

9889.0

35093.0

14005.0

Number of unassigned genes

5198.0

3440.0

2535.0

3426.0

4487.0

3792.0

Percentage of genes in orthogroups

71.3

92.9

95.7

74.3

88.7

78.7

Percentage of unassigned genes

28.7

7.1

4.3

25.7

11.3

21.3

Number of orthogroups containing species

4719.0

8553.0

9321.0

4723.0

9480.0

5539.0

Percentage of orthogroups containing species

18.3

33.2

36.2

18.4

36.8

21.5

Number of species-specific orthogroups

1310.0

4665.0

4930.0

769.0

6636.0

1322.0

Number of genes in species-specific orthogroups

7616.0

28905.0

36927.0

4152.0

26548.0

7075.0

Percentage of genes in species-specific orthogroups

42.1

59.9

62.8

31.2

67.1

39.8

As these results are very numerous, lets graph them. We will do a subset of the non-percentage data first:

overall_stats.number <- overall_stats[grep(x = overall_stats$X, pattern = "Number"),]
overall_stats.number <- melt(overall_stats.number)
## Using X as id variables
overall_stats.number$X <- gsub(overall_stats.number$X, pattern = "Number of ", replacement = "") %>% tools::toTitleCase()

ggplot(overall_stats.number, aes(x=variable , y=value, fill=X)) + geom_bar(stat="identity") + facet_grid(X~., scales = "free_y") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none") + ylab("Number")

Questions (Section 1 - Numeric results):
  1. What is the species with the most and the least number of genes used in the analysis?

  2. What does the Species specific orthologs category mean? Is this representing paralogy?

  3. Does this table/plot accurately represent the contribution of each species to the orthology assay?


While the numeric results provide us with an overall assay of the general results of the orthologous assay, the data formatted in percentages may prove a more standardized vision of the results:

overall_stats.percentage <- overall_stats[grep(x = overall_stats$X, pattern = "Number", invert = T),]
overall_stats.percentage <- melt(overall_stats.percentage)
## Using X as id variables
overall_stats.percentage$X <- gsub(overall_stats.percentage$X, pattern = "Percentage of ", replacement = "") %>% tools::toTitleCase()

ggplot(overall_stats.percentage, aes(x=variable , y=value, fill=X)) + geom_bar(stat="identity") + facet_grid(X~., scales = "free_y") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "none") + ylab("Percentage (%)")

Questions (Section 1 - Percentage results):
  1. What is the species with the most shared genes across the entire data set?

Section 2. Per species- per ortholog statistics (Lines 14 to 56)

The second section of the Statistics per Species file includes all the statistics of the number and percentage of orthologs that contain data for each species. For today we will focus only on the percentages (Lines 35 - 56):

perecentage_stats <- read.delim("Results/Statistics_PerSpecies.tsv",fill = T, skip = 35, nrows = 20, sep = "\t", header = T)
colnames(perecentage_stats)[-1] <- species_names

knitr::kable(perecentage_stats) %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Number.of.genes.per.species.in.orthogroup

Amanita_muscaria

Arabidopsis_thaliana

Canis_lupus

Dyctiostelium_discoideum

Paramecium_tetraurelia

Phytophthora_infestans

’0

81.7

66.8

63.8

81.6

63.2

78.5

’1

10.7

4.1

5.1

12.5

3.1

12.1

’2

3.6

8.9

6.8

3.1

16.0

4.7

’3

1.4

5.7

5.0

1.1

6.1

1.7

’4

0.7

3.7

3.7

0.5

4.4

0.9

’5

0.4

2.4

2.9

0.3

2.0

0.5

’6

0.3

1.8

2.1

0.2

1.5

0.3

’7

0.2

1.3

1.8

0.2

0.9

0.3

’8

0.1

0.9

1.4

0.1

0.6

0.2

’9

0.1

0.7

1.1

0.1

0.5

0.2

’10

0.1

0.5

0.9

0.1

0.2

0.1

11-15

0.3

1.5

2.5

0.2

0.7

0.3

16-20

0.1

0.7

1.1

0.1

0.3

0.2

21-50

0.2

1.1

1.5

0.1

0.4

0.2

51-100

0.0

0.1

0.1

0.0

0.1

0.0

101-150

0.0

0.0

0.0

0.0

0.0

0.0

151-200

0.0

0.0

0.0

0.0

0.0

0.0

201-500

0.0

0.0

0.0

0.0

0.0

0.0

501-1000

0.0

0.0

0.0

0.0

0.0

0.0

’1001+

0.0

0.0

0.0

0.0

0.0

0.0

perecentage_stats$Number.of.genes.per.species.in.orthogroup <- factor(perecentage_stats$Number.of.genes.per.species.in.orthogroup, levels = perecentage_stats$Number.of.genes.per.species.in.orthogroup)

perecentage_stats <- melt(perecentage_stats)
## Using Number.of.genes.per.species.in.orthogroup as id variables
ggplot(perecentage_stats, aes(x=Number.of.genes.per.species.in.orthogroup , y=value, fill=variable)) + geom_bar(stat="identity", position = "dodge") + scale_fill_viridis_d() + theme_bw() + theme(legend.position = "bottom") + ylab("Percentage (%)")

Questions (Section 2 - Percentage results):
  1. Why are these results expected?

The count of genes per Orthogroup predicted by OrthoFinder

Location: /raid1/home/bpp/tabimaj/orthofinder_test/OrthoFinder/Results_Feb06_1/Orthogroups/Orthogroups.GeneCount.tsv

The Statistics per Species and Statistics Overall files provided us with a general view of the contribution of each genome and each orthogroup into our orthology analysis. For a more detailed look into our results, we should look at the Orthogroups GeneCount file This file includes the number of genes per species that are grouped inside each predicted orthogroup (OG).

Lets take a look at the first 5 rows of the file:

og.gene_count <- read.table("Results/Orthogroups.GeneCount.tsv", header = T)

head(og.gene_count) %>% knitr::kable() %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Orthogroup

Amanita_muscaria

Arabidopsis_thaliana

Canis_lupus

Dyctiostelium_discoideum

Paramecium_tetraurelia

Phytophthora_infestans

Total

OG0000000

238

0

0

0

0

0

238

OG0000001

1

234

0

0

0

1

236

OG0000002

200

0

0

0

0

2

202

OG0000003

2

24

0

15

0

84

125

OG0000004

2

46

35

10

5

20

118

OG0000005

0

117

0

0

0

0

117

The results of the Orthogroups GeneCount file are sorted by number of total genes per orthogroup. These results show that the contributions of each species into each orthogroup are quite different. For OG0000000 we see all genes (238 genes) are contributed by Amanita muscaria. This is a Single species orthogroup.

Conversely, OG0000003, OG0000004 and OG0000005 are comprised by many genes of different species. So these are Shared species orthogroups.

We can observe the general pattern of the orthogroups by plotting a heat map for the first 200 orthogroups without the Total column

library(vcfR)
## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.9.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****
OG.mtrix <- as.matrix(og.gene_count[c(1:200),-c(1,8)])
rownames(OG.mtrix) <- og.gene_count[c(1:200),1]
OG.mtrix[OG.mtrix == 0] <- NA
heatmap.bp(OG.mtrix)

Several things can be concluded from that figure. First, that the distribution of species that contribute in each OG is patchy. Second, that for the first 200 OG, the species that contribute the most genes are A. thaliana and C. lupus. Third, that there are some OG that have genes in every species. When we review the results of the Statistics Overall table, we remember that 1319 OG have all species present. We should be able to see the contributions of each species per OG by removing all OG with values of 0 and counting the remaining OG:

non_zero_OG <- og.gene_count[apply(og.gene_count[,-1], 1, function (x) sum(x == 0) == 0),]
nrow(non_zero_OG)
## [1] 1319

Woohoo! Lets see the first 10 rows of the OG with all species:

head(non_zero_OG, n=10) %>% knitr::kable() %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Orthogroup

Amanita_muscaria

Arabidopsis_thaliana

Canis_lupus

Dyctiostelium_discoideum

Paramecium_tetraurelia

Phytophthora_infestans

Total

5

OG0000004

2

46

35

10

5

20

118

11

OG0000010

6

7

35

8

29

15

100

18

OG0000017

8

9

10

10

32

14

83

24

OG0000023

10

18

19

13

7

13

80

25

OG0000024

24

18

20

14

2

1

79

27

OG0000026

8

19

24

6

16

5

78

35

OG0000034

3

12

20

3

21

15

74

42

OG0000041

5

20

31

2

11

3

72

44

OG0000043

4

7

34

7

11

7

70

45

OG0000044

2

2

28

5

19

14

70

What about the last 10 rows?

tail(non_zero_OG, n=10) %>% knitr::kable() %>%
 kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Orthogroup

Amanita_muscaria

Arabidopsis_thaliana

Canis_lupus

Dyctiostelium_discoideum

Paramecium_tetraurelia

Phytophthora_infestans

Total

8294

OG0008293

1

1

1

1

1

1

6

8304

OG0008303

1

1

1

1

1

1

6

8316

OG0008315

1

1

1

1

1

1

6

8328

OG0008327

1

1

1

1

1

1

6

8330

OG0008329

1

1

1

1

1

1

6

8334

OG0008333

1

1

1

1

1

1

6

8338

OG0008337

1

1

1

1

1

1

6

8342

OG0008341

1

1

1

1

1

1

6

8344

OG0008343

1

1

1

1

1

1

6

8355

OG0008354

1

1

1

1

1

1

6

Questions (Section 3):

  1. What is the biological importance of these OG that have genes in every species?

  2. What are the last 10 rows showing us? Why are these OG important?


The Single copy orthologs and the Species Tree

Finally, we will focus in two very important results of OrthoFinder: The Single copy orthologs and the Species Tree.

The Single copy orthologs

The Single copy orthologs are homologous and orthologous genes that exist in every single species and only have one copy. These genes are of fundamental importance for molecular evolution, as are genes that can be used to study the impacts of selection or drift across our data set. In our case, the tree of life. We can identify those genes by summarizing the Orthogroups GeneCount file, or by summarizing the number of .fasta files in the Single_Copy_Orthologue_Sequences folder in our OrthoGroups folder.

We have a total of 32 single copy ortholog groups across our data set. If we want to know the function, we can use a very well annotated genome and search for the function.

For example, OG0008354.fa has the A. thaliana gene NP_193902.1. We can go to NBCI and check the annotation of this gene:

Annotation associated to NP_193902.1 from A. thaliana

As we can see, the NP_193902.1 gene has the annotation associated with DNA-directed RNA polymerase family protein, so we can infer that this function is conserved across the tree of life.

In addition, we can see if this protein sequence has evolved differently across the tree of life. To do so, we can do a multiple sequence alignment of the OG0008354.fa gene using MAFFT

Multiple sequence alignment associated to OG0008354*

The MSA statistics show that this protein family has 366 (28.5%) identical sites, with a pairwise Identity: of 56.0%. So its quite the conserved gene across the eukaryotic tree of life.

Questions (Section 4 - Single copy orthologs):

  1. What are the other functions of the other single copy OG?

  2. Are the other single copy OG more or less conserved phylogenetically than OG0008354?


The **Species tree **

The Species Tree is the phylogenetic reconstruction of the species used in this orthology analysis. How is it calculated? According to Emms and Kelly, 2019:

OrthoFinder infers orthologs from the orthogroup trees (a gene tree for the orthogroup) using the steps shown in Fig. 2. Input proteomes are provided by the user using one FASTA file per species. Each file contains the amino acid sequences for the proteins in that species. Orthogroups are inferred using the original OrthoFinder algorithm [10]; an unrooted gene tree is inferred for each orthogroup using DendroBLAST [24]; the unrooted species tree is inferred from this set of unrooted orthogroup trees using the STAG algorithm [33]; this STAG species tree is then rooted using the STRIDE algorithm by identifying high-confidence gene duplication events in the complete set of unrooted orthogroup trees [22]; the rooted species tree is used to root the orthogroup trees; orthologs and gene duplication events are inferred from the rooted orthogroup trees by a novel hybrid algorithm that combines the “species-overlap” method [31] and the duplication-loss-coalescent model [32] (described below); and comparative statistics are calculated.

So, how does the species tree look? We can find it at ~/orthofinder_test/OrthoFinder/Results_Feb06_1/Species_Tree/SpeciesTree_rooted.txt. We can open it on R:

library(ggtree)
## Registered S3 method overwritten by 'treeio':
##   method     from
##   root.phylo ape

## ggtree v1.16.6  For help: https://yulab-smu.github.io/treedata-book/
## 
## If you use ggtree in published research, please cite the most appropriate paper(s):
## 
## �[36m-�[39m Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods for mapping and visualizing associated data on phylogeny using ggtree. Molecular Biology and Evolution 2018, 35(12):3041-3043. doi: 10.1093/molbev/msy194
## �[36m-�[39m Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628

## 
## Attaching package: 'ggtree'

## The following object is masked from 'package:tidyr':
## 
##     expand

## The following object is masked from 'package:magrittr':
## 
##     inset
species.tree <- read.tree("Results/SpeciesTree_rooted.txt.tre")

ggtree(species.tree, layout = "circular") + geom_tiplab2(size=3) + xlim(0, 1)

Questions (Section 4 - Species tree):

  1. How does the tree look? Does it make sense biologically?