Possvm accuracy benchmarks

This is a companion repository to Possvm, and contains scripts and data to reproduce the benchmarking analyses done for the Possvm manuscript (Grau-Bové and Sebé-Pedrós, MBE 2021).

You can find the Possvm source code, manual and installation instructions in the main repository for this project.

Data sources

If you use the datasets provided in this repository, cite the original sources:

HomeoDB database (v2): Zhong et al. Evolution & Development, 2011.
Orthobench repository (v2.0): Emms et al. GBE 2020.

If you use Possvm, please cite the following papers:

Possvm: Grau-Bové and Sebé-Pedrós, bioRxiv 2021.
ETE toolkit: Huerta-Cepas et al. Molecular Biology and Evolution 2016.
Species overlap algorithm: Huerta-Cepas et al. Genome Biolgy 2007.
MCL clustering: Enright et al. Nucleic Acids Research 2002.

Homeobox classification

Test the accuracy of Possvm using manually curated homeobox families from HomeoDB.

From the homeobox-test folder:

Get ANTP sequences from HomeoDB (as of 7th Feb 2021).
Blast (diamond).

# to bilateria+cnidaria dataset
bash s01_get_trees-diamond.sh seed_ANTP.fasta ANTP proteomes/
# UNUSED:
# bash s01_get_trees-diamond.sh seed_PRD.fasta  PRD  proteomes/
# bash s01_get_trees-diamond.sh seed_TALE.fasta TALE proteomes/

# UNUSED:
# to orthobench dataset (17 metazoan species)
# bash s01_get_trees-diamond.sh seed_tale.fasta tale ../orthobench-test/proteomes/
# bash s01_get_trees-diamond.sh seed_ANTP.fasta ANTP ../orthobench-test/proteomes/
# bash s01_get_trees-diamond.sh seed_PRD.fasta PRD ../orthobench-test/proteomes/

Run Possvm.

# base run with iterative rooting
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom -itermidroot 10

# ALTERNATIVE TAXON SAMPLING
# run with iterative rooting, excluding cnidarians from the orthology graph
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_nocni -itermidroot 10 --outgroup outgroups.txt

# ALTERNATIVE CLUSTERING METHODS
# LPA clustering
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_lpa -itermidroot 10 -method lpa
# Louvain clustering
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_lou -itermidroot 10 -method louvain
# Clauset-Newman-Moore greedy modularity maximization
python ../scripts/possvm-greedy.py -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_gre -itermidroot 10
# k-clique
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_kcl -itermidroot 10 -method kclique
# MCL weighted by node supports
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_mclw -itermidroot 10 -method mclw


# ALTERNATIVE ROOTS
# run with midpoint rooting
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_mid

# ALTERNATIVE TREES
# PRD and TALE homeoboxes
possvm -i results_trees/PRD.genes.iqtree.treefile -p PRD.possom -itermidroot 10
possvm -i results_trees/TALE.genes.iqtree.treefile -p TALE.possom -itermidroot 10

# ALTERNATIVE ALL
# base run with species tree reconciliation
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom -itermidroot 10 -spstree taxon_sampling.newick

Evaluate using classification from blast to HomeoDB.

Rscript s02_evaluate_homeodb_master.R

Annotate trees with human genes:

# using iterative rooting
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom -itermidroot 10 -r reference_ANTP.type.csv -refsps Hsap -o results_annotation -printallpairs -min_support_transfer 50
# possvm -i results_trees/PRD.genes.iqtree.treefile -p PRD.possom -itermidroot 10 -r reference_PRD.type.csv -refsps Hsap -o results_annotation -printallpairs -min_support_transfer 50
# possvm -i results_trees/TALE.genes.iqtree.treefile -p TALE.possom -itermidroot 10 -r reference_TALE.type.csv -refsps Hsap -o results_annotation -printallpairs -min_support_transfer 50

# using midpoint rooting
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_mid -r reference_ANTP.type.csv -refsps Hsap -o results_annotation -printallpairs -min_support_transfer 50

Downsample ANTP trees, by downsampling dataset (always include Homo so that we have at least one reference species to evaluate, though):

# UNUSED
# create 20 versions of the original dataset downsampled to 10%, 20%, ... 70% of the species
# Rscript s10_downsample_original_alignment.R
# # launch trees
# for i in results_trees/downsampling_alignment/downsample_0.*.genes.lt.fasta ; do qsub -N tree-$(basename ${i}) -pe smp 4 qsub_iqtree.sh ${i} 4 ; done
# # call OGs
# for i in results_trees/downsampling_alignment/downsample*treefile ; do possvm -i $i -p $(basename ${i}) -itermidroot 10 -skipprint ; done
# # evaluate precision and recall
# Rscript s11_evaluate_downsampling.R

Evaluate effect of tree accuracy in ANTP classification, by permutating tip labels in the original tree

# permutations
Rscript s20_randomise_tips_all.R
# call OGs
for i in results_trees/downsampling/downsample*newick ; do possvm -i $i -p $(basename ${i}) -itermidroot 10 -skipprint ; done
# evaluate precision and recall
Rscript s21_evaluate_permutations_all.R

Alternative methods

BranchClust & HomeoDB

From the homeobox-test folder:

Prepare taxa dictionary files:

mkdir -p results_branchclust/
for i in orthobench_trees/raw/RefOG0*.ortholog_groups.csv ; do awk 'NR>1' $i | cut -f1 -d '_' ; done | sort -u | awk '{ print $1" | " $1"_.*" }' > results_branchclust/gi_numbers.out

Run BranchClust at various possible <MANY> values: 50% of the datset, 60%, 70%, and 80% (to assess tradeoff between precision and recall):

cd results_branchclust/
i=../results_trees/ANTP.genes.iqtree.treefile
# let's define ntax as 60% of the species in tree (recommended values: 50-80%)
for m in 50 60 70 80 ; do
ntax=$(cat ${i} | tr ',' '\n' | tr -d '()' |grep "^[A-Z]" | cut -f1 -d '_'| sort -u| wc -l | awk 'n=($1 * "'$m'" / 100) { print int(n) }')
perl ../../scripts/BranchClust_1.00.pl ${i} ${ntax}
mv clusters.out ANTP.bc_clusters_${m}.out
mv families.list ANTP.bc_families_${m}.out
grep -A 1 ' CLUSTER ' ANTP.bc_clusters_${m}.out | grep -v " CLUSTER " | grep -v "\-\-" | awk 'NR == 1 { print "gene\torthogroup" } { for(i=1; i<=NF; i++) { print $i"\tOG."NR }}' > ANTP.bc_clusters_${m}.csv
done

Test BranchClust:

# evaluated with s02_evaluate_homeodb_master.R

Phylome-style & HomeoDB

From the homeobox-test folder:

OGs based on connected components starting from a focus species (human), using Orthobench trees:

i=results_trees/ANTP.genes.iqtree.treefile
python ../scripts/phylome-ccs-focus-Hsap.py -i $i -p  $(basename ${i%%.*}).ccsh -ogprefix ccsh
python ../scripts/phylome-ccs-focus-Dmel.py -i $i -p  $(basename ${i%%.*}).ccsd -ogprefix ccsd
# evaluated with s02_evaluate_homeodb_master.R

Orthobench

Test the accuracy of Possvm using manually curated orthologs from the Orthobench 2.0 repository (see Emms et al. GBE 2020).

From the orthobench-test folder:

Get Orthobench reference groups (refOG) from the Open_Orthobench Github; total: 70 RefOGs), as well as raw and tight trees (ignore intermediate trees).

# get Orthobench repository
git clone git@github.com:davidemms/Open_Orthobench.git

# refOGs assignments:
ls <path>/Open_Orthobench/BENCHMARKS/RefOGs/

# tight and raw trees:
ls <path>/Open_Orthobench/Supporting_Data/Data_for_RefOGs/trees
ls <path>/Open_Orthobench/Supporting_Data/Data_for_RefOGs/trees_tight

# # HMM profiles in this folder
# ls <path>/Open_Orthobench/Supporting_Data/Additional_Files/hmm_profiles/
# # copy the HMM files into the `hmm_profiles_strict` and `hmm_profiles_weak` folders in the present directory

NOTE: Fix duplicated assignment of FBpp0309618 to the RefOG021 and RefOG068 clusters (remove from latter), format into refOGs.csv.
Get primary transcripts from 17 metazoan species from the Orthobench repository, and HMM profiles for each refOG:

# primary transcripts for each species, in this folder:
ls <path>/Open_Orthobench/Supporting_Data/Additional_Files/proteomes/primary_transcripts
# copy the fasta files into the `proteomes` folder in in the present directory and concatenate them...

Run Possvm as follows:

# find OGs in each tree with possom, using the original trees from orthobench:
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom -ogprefix "$(basename ${i%%.*})." ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_iter -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_mclw -ogprefix "$(basename ${i%%.*})." -method mclw    ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_lpa -ogprefix "$(basename ${i%%.*})." -method lpa      ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_lou -ogprefix "$(basename ${i%%.*})." -method louvain  ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_skr -ogprefix "$(basename ${i%%.*})." -skiproot ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_spst -ogprefix "$(basename ${i%%.*})." -spstree species_tree.newick  ; done

# UNUSED
# for i in orthobench_trees/tight/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom -ogprefix "$(basename ${i%%.*})." ; done # UNUSED

Calculate accuracy relative to refOGs.csv:

Rscript s02_evaluate_orthobench_master.R

Run homology searches, MSAs and new phylogenies:

# # run from the present directory
# bash s01_get_trees-diamond.sh

# # find OGs in each tree with possom, using new expanded trees (necessary to assess the effect of iterative rooting?)
# cp results_trees/*.treefile results_orthology/
# for i in results_orthology/*.treefile ; do possvm -i $i -p  $(basename ${i%%.*}).possom -ogprefix "$(basename ${i%%.*})." ; done
# for i in results_orthology/*.treefile ; do possvm -i $i -p $(basename ${i%%.*}).possom_iter -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done

# # evaluate iterative rooting accuracy:
# # DEPRECATED
# Rscript s03_evaluate_orthobench_one2one_iter.R
# Rscript s04_compare_midroot_iterroot.R

Effect of iterative rooting

Based on raw Orthobench trees, let's increase the length of random internal branches to evaluate the effect of the iterative tree rooting procedure:

Create collection of inflated trees (20 per original tree):

Rscript s05_root_test_inflate_collection_v2.R

Run Possvm with and without iterative rooting:

# range of inflation values
for i in results_rooting_inflated_trees/RefOG0*.tre ; do possvm -i $i -p  $(basename ${i%%.tre}).possom_mid -ogprefix "$(basename ${i%%.*})." --skipprint; done
for i in results_rooting_inflated_trees/RefOG0*.tre ; do possvm -i $i -p  $(basename ${i%%.tre}).possom_ite -ogprefix "$(basename ${i%%.*})." -itermidroot 10 --skipprint ; done

# DEPRECATED:
# inflation in one branch
# for i in results_rooting_inflation_one/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_mid -ogprefix "$(basename ${i%%.*})." ; done
# for i in results_rooting_inflation_one/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_ite -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done
# # low inflation (bound between 5x and 20x, for 5% of edges)
# for i in results_rooting_inflation/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_mid -ogprefix "$(basename ${i%%.*})." ; done
# for i in results_rooting_inflation/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_ite -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done
# # high inflation (bound between 5x and 50x, for 10% of edges)
# for i in results_rooting_inflation_high/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_mid -ogprefix "$(basename ${i%%.*})." ; done
# for i in results_rooting_inflation_high/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_ite -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done
# # higher inflation (bound between 5x and 50x, for 20% of edges)
# for i in results_rooting_inflation_higher/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_mid -ogprefix "$(basename ${i%%.*})." ; done
# for i in results_rooting_inflation_higher/*.tre ; do possvm -i $i -p  $(basename ${i%%.*}).possom_ite -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done

Evaluate difference:

Rscript s06_root_test_evaluate_iterroot.R

Alternative methods

BranchClust in Orthobench

From the orthobench-test folder:

Prepare taxa dictionary files:

mkdir -p results_branchclust/
for i in orthobench_trees/raw/RefOG0*.ortholog_groups.csv ; do awk 'NR>1' $i | cut -f1 -d '_' ; done | sort -u | awk '{ print $1" | " $1"_.*" }' > results_branchclust/gi_numbers.out

Run BranchClust at various possible <MANY> values: 50% of the datset, 60%, 70%, and 80% (to assess tradeoff between precision and recall):

cd results_branchclust/
for i in $(cut -f1 ../refOGs.csv | sort -u ) ; do
# let's define ntax as 60% of the species in tree (recommended values: 50-80%)
# ntax=$(cat ../orthobench_trees/raw/${i}.tre| tr ',' '\n' | tr -d '()' |grep "^[A-Z]" | cut -f1 -d '_'| sort -u| wc -l | awk 'n=($1 * 0.6) { print int(n) }')
# 50%
perl ../../scripts/BranchClust_1.00.pl ../orthobench_trees/raw/${i}.tre 9
mv clusters.out ${i}.bc_clusters_50.out
mv families.list ${i}.bc_families_50.out
grep -A 1 ' CLUSTER ' ${i}.bc_clusters_50.out | grep -v " CLUSTER " | grep -v "\-\-" | awk 'NR == 1 { print "gene\torthogroup" } { for(i=1; i<=NF; i++) { print $i"\t'${i}'."NR }}' > ${i}.bc_clusters_50.csv
# 60%
perl ../../scripts/BranchClust_1.00.pl ../orthobench_trees/raw/${i}.tre 10
mv clusters.out ${i}.bc_clusters_60.out
mv families.list ${i}.bc_families_60.out
grep -A 1 ' CLUSTER ' ${i}.bc_clusters_60.out | grep -v " CLUSTER " | grep -v "\-\-" | awk 'NR == 1 { print "gene\torthogroup" } { for(i=1; i<=NF; i++) { print $i"\t'${i}'."NR }}' > ${i}.bc_clusters_60.csv
# 70%
perl ../../scripts/BranchClust_1.00.pl ../orthobench_trees/raw/${i}.tre 12
mv clusters.out ${i}.bc_clusters_70.out
mv families.list ${i}.bc_families_70.out
grep -A 1 ' CLUSTER ' ${i}.bc_clusters_70.out | grep -v " CLUSTER " | grep -v "\-\-" | awk 'NR == 1 { print "gene\torthogroup" } { for(i=1; i<=NF; i++) { print $i"\t'${i}'."NR }}' > ${i}.bc_clusters_70.csv
# 80%
perl ../../scripts/BranchClust_1.00.pl ../orthobench_trees/raw/${i}.tre 14
mv clusters.out ${i}.bc_clusters_80.out
mv families.list ${i}.bc_families_80.out
grep -A 1 ' CLUSTER ' ${i}.bc_clusters_80.out | grep -v " CLUSTER " | grep -v "\-\-" | awk 'NR == 1 { print "gene\torthogroup" } { for(i=1; i<=NF; i++) { print $i"\t'${i}'."NR }}' > ${i}.bc_clusters_80.csv
done

Test BranchClust:

# within s02_evaluate_orthobench_master.R

Phylome-style in Orthobench

From the orthobench-test folder:

OGs based on connected components starting from a focus species (human), using Orthobench trees:

for i in orthobench_trees/raw/*.tre ; do python ../scripts/phylome-ccs-focus-Hsap.py -i $i -p  $(basename ${i%%.*}).ccsh -ogprefix "$(basename ${i%%.*})." ; done
for i in orthobench_trees/raw/*.tre ; do python ../scripts/phylome-ccs-focus-Dmel.py -i $i -p  $(basename ${i%%.*}).ccsd -ogprefix "$(basename ${i%%.*})." ; done
# within s02_evaluate_orthobench_master.R

Deprecated alternatives:

OGs based on the list of genes descending from a given duplication node (absurd):

# UNUSED
# for i in orthobench_trees/raw/*.tre ; do python ../scripts/phylome-dups.py -i $i -p  $(basename ${i%%.*}).dups -ogprefix "$(basename ${i%%.*})." ; done

OGs based on connected components, using PhylomeDB trees from a focus species (human) — this seems fair, but it's not a great idea: many reasons for sequence dropout may artifically reduce recall.

# DEPRECATED
# # get data from http://www.phylomedb.org/phylome_514 (meta human phylome with 66 species, including 15 species also present in Orthobench)
# cat ../phylome-data/phylome-0514-human/proteomes/*.fa > ../phylome-data/phylome-0514-human/proteomes_phylome.fasta
# diamond makedb --in ../phylome-data/phylome-0514-human/proteomes_phylome.fasta -d ../phylome-data/phylome-0514-human/proteomes_phylome.fasta

# # find matching sequences in phylomeDB
# mkdir -p results_phylomedb/
# for i in $(cut -f1 refOGs.csv | sort -u ) ; do
# diamond blastp --more-sensitive -d ../phylome-data/phylome-0514-human/proteomes_phylome.fasta -q results_searches/${i}.seed.fasta -o results_phylomedb/${i}.diamond_phylome.csv -p 4
# done

# # do it with all seqs
# diamond blastp -q ../phylome-data/phylome-0514-human/proteomes_phylome.fasta -d proteomes/all_proteomes.fa -o results_phylomedb/dictionary_diamond_phylome.csv -p 10
# awk '$3 == 100 { print $1,$2 }' results_phylomedb/dictionary_diamond_phylome.csv | tr ' ' '\t' > results_phylomedb/dictionary_diamond_phylome_filt.csv

# # get ortholog pairs from phylomeDB
# for i in $(cut -f1 refOGs.csv | sort -u ) ; do
# fgrep -f <(grep -w ${i} refOGs.csv | cut -f2) results_phylomedb/dictionary_diamond_phylome_filt.csv | grep "_HUMAN" | cut -f1 | sort -u > results_phylomedb/${i}.diamond_phylome.human.txt
# fgrep -f results_phylomedb/${i}.diamond_phylome.human.txt ../phylome-data/phylome-0514-human/orthologs.txt > results_phylomedb/${i}.diamond_phylome.human_orthologs.csv
# done

#

Setup of alternative species-overlap methods

Setup of alternative methods.

BranchClust

Download:

wget http://www.bioinformatics.org/branchclust/BranchClust_1-00.txt && mv BranchClust_1-00.txt ../scripts/BranchClust_1.00.pl
wget http://www.bioinformatics.org/branchclust/BranchClust_Tutorial.tgz && mv BranchClust_Tutorial.tgz ../scripts/
tar xvfz ../scripts/BranchClust_Tutorial.tgz && mv BranchClust_Tutorial/ ../scripts
wget http://bioinformatics.org/branchclust/BranchClust_all.tgz && tar xvfz BranchClust_all.tgz && mv BranchClust_all.tgz BranchClust_all ../scripts

Phylome-style

We'll define orthogroups from a tree using naive approaches based on species overlap, and compare these orthogroups with those defined by Possvm.

xgrau/possvm-orthology-benchmarking

Possvm accuracy benchmarks

Data sources

Homeobox classification

Alternative methods

BranchClust & HomeoDB

Phylome-style & HomeoDB

Orthobench

Effect of iterative rooting

Alternative methods

BranchClust in Orthobench

Phylome-style in Orthobench

Setup of alternative species-overlap methods

BranchClust

Phylome-style