This is a companion repository to Possvm, and contains scripts and data to reproduce the benchmarking analyses done for the Possvm manuscript (Grau-Bové and Sebé-Pedrós, MBE 2021).
You can find the Possvm source code, manual and installation instructions in the main repository for this project.
If you use the datasets provided in this repository, cite the original sources:
- HomeoDB database (v2): Zhong et al. Evolution & Development, 2011.
- Orthobench repository (v2.0): Emms et al. GBE 2020.
If you use Possvm, please cite the following papers:
- Possvm: Grau-Bové and Sebé-Pedrós, bioRxiv 2021.
- ETE toolkit: Huerta-Cepas et al. Molecular Biology and Evolution 2016.
- Species overlap algorithm: Huerta-Cepas et al. Genome Biolgy 2007.
- MCL clustering: Enright et al. Nucleic Acids Research 2002.
Test the accuracy of Possvm using manually curated homeobox families from HomeoDB.
From the homeobox-test
folder:
-
Get ANTP sequences from HomeoDB (as of 7th Feb 2021).
-
Blast (diamond).
# to bilateria+cnidaria dataset
bash s01_get_trees-diamond.sh seed_ANTP.fasta ANTP proteomes/
# UNUSED:
# bash s01_get_trees-diamond.sh seed_PRD.fasta PRD proteomes/
# bash s01_get_trees-diamond.sh seed_TALE.fasta TALE proteomes/
# UNUSED:
# to orthobench dataset (17 metazoan species)
# bash s01_get_trees-diamond.sh seed_tale.fasta tale ../orthobench-test/proteomes/
# bash s01_get_trees-diamond.sh seed_ANTP.fasta ANTP ../orthobench-test/proteomes/
# bash s01_get_trees-diamond.sh seed_PRD.fasta PRD ../orthobench-test/proteomes/
- Run Possvm.
# base run with iterative rooting
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom -itermidroot 10
# ALTERNATIVE TAXON SAMPLING
# run with iterative rooting, excluding cnidarians from the orthology graph
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_nocni -itermidroot 10 --outgroup outgroups.txt
# ALTERNATIVE CLUSTERING METHODS
# LPA clustering
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_lpa -itermidroot 10 -method lpa
# Louvain clustering
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_lou -itermidroot 10 -method louvain
# Clauset-Newman-Moore greedy modularity maximization
python ../scripts/possvm-greedy.py -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_gre -itermidroot 10
# k-clique
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_kcl -itermidroot 10 -method kclique
# MCL weighted by node supports
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_mclw -itermidroot 10 -method mclw
# ALTERNATIVE ROOTS
# run with midpoint rooting
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_mid
# ALTERNATIVE TREES
# PRD and TALE homeoboxes
possvm -i results_trees/PRD.genes.iqtree.treefile -p PRD.possom -itermidroot 10
possvm -i results_trees/TALE.genes.iqtree.treefile -p TALE.possom -itermidroot 10
# ALTERNATIVE ALL
# base run with species tree reconciliation
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom -itermidroot 10 -spstree taxon_sampling.newick
- Evaluate using classification from blast to HomeoDB.
Rscript s02_evaluate_homeodb_master.R
- Annotate trees with human genes:
# using iterative rooting
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom -itermidroot 10 -r reference_ANTP.type.csv -refsps Hsap -o results_annotation -printallpairs -min_support_transfer 50
# possvm -i results_trees/PRD.genes.iqtree.treefile -p PRD.possom -itermidroot 10 -r reference_PRD.type.csv -refsps Hsap -o results_annotation -printallpairs -min_support_transfer 50
# possvm -i results_trees/TALE.genes.iqtree.treefile -p TALE.possom -itermidroot 10 -r reference_TALE.type.csv -refsps Hsap -o results_annotation -printallpairs -min_support_transfer 50
# using midpoint rooting
possvm -i results_trees/ANTP.genes.iqtree.treefile -p ANTP.possom_mid -r reference_ANTP.type.csv -refsps Hsap -o results_annotation -printallpairs -min_support_transfer 50
- Downsample ANTP trees, by downsampling dataset (always include Homo so that we have at least one reference species to evaluate, though):
# UNUSED
# create 20 versions of the original dataset downsampled to 10%, 20%, ... 70% of the species
# Rscript s10_downsample_original_alignment.R
# # launch trees
# for i in results_trees/downsampling_alignment/downsample_0.*.genes.lt.fasta ; do qsub -N tree-$(basename ${i}) -pe smp 4 qsub_iqtree.sh ${i} 4 ; done
# # call OGs
# for i in results_trees/downsampling_alignment/downsample*treefile ; do possvm -i $i -p $(basename ${i}) -itermidroot 10 -skipprint ; done
# # evaluate precision and recall
# Rscript s11_evaluate_downsampling.R
- Evaluate effect of tree accuracy in ANTP classification, by permutating tip labels in the original tree
# permutations
Rscript s20_randomise_tips_all.R
# call OGs
for i in results_trees/downsampling/downsample*newick ; do possvm -i $i -p $(basename ${i}) -itermidroot 10 -skipprint ; done
# evaluate precision and recall
Rscript s21_evaluate_permutations_all.R
From the homeobox-test
folder:
- Prepare taxa dictionary files:
mkdir -p results_branchclust/
for i in orthobench_trees/raw/RefOG0*.ortholog_groups.csv ; do awk 'NR>1' $i | cut -f1 -d '_' ; done | sort -u | awk '{ print $1" | " $1"_.*" }' > results_branchclust/gi_numbers.out
- Run BranchClust at various possible
<MANY>
values: 50% of the datset, 60%, 70%, and 80% (to assess tradeoff between precision and recall):
cd results_branchclust/
i=../results_trees/ANTP.genes.iqtree.treefile
# let's define ntax as 60% of the species in tree (recommended values: 50-80%)
for m in 50 60 70 80 ; do
ntax=$(cat ${i} | tr ',' '\n' | tr -d '()' |grep "^[A-Z]" | cut -f1 -d '_'| sort -u| wc -l | awk 'n=($1 * "'$m'" / 100) { print int(n) }')
perl ../../scripts/BranchClust_1.00.pl ${i} ${ntax}
mv clusters.out ANTP.bc_clusters_${m}.out
mv families.list ANTP.bc_families_${m}.out
grep -A 1 ' CLUSTER ' ANTP.bc_clusters_${m}.out | grep -v " CLUSTER " | grep -v "\-\-" | awk 'NR == 1 { print "gene\torthogroup" } { for(i=1; i<=NF; i++) { print $i"\tOG."NR }}' > ANTP.bc_clusters_${m}.csv
done
- Test BranchClust:
# evaluated with s02_evaluate_homeodb_master.R
From the homeobox-test
folder:
- OGs based on connected components starting from a focus species (human), using Orthobench trees:
i=results_trees/ANTP.genes.iqtree.treefile
python ../scripts/phylome-ccs-focus-Hsap.py -i $i -p $(basename ${i%%.*}).ccsh -ogprefix ccsh
python ../scripts/phylome-ccs-focus-Dmel.py -i $i -p $(basename ${i%%.*}).ccsd -ogprefix ccsd
# evaluated with s02_evaluate_homeodb_master.R
Test the accuracy of Possvm using manually curated orthologs from the Orthobench 2.0 repository (see Emms et al. GBE 2020).
From the orthobench-test
folder:
- Get Orthobench reference groups (
refOG
) from the Open_Orthobench Github; total: 70 RefOGs), as well as raw and tight trees (ignore intermediate trees).
# get Orthobench repository
git clone git@github.com:davidemms/Open_Orthobench.git
# refOGs assignments:
ls <path>/Open_Orthobench/BENCHMARKS/RefOGs/
# tight and raw trees:
ls <path>/Open_Orthobench/Supporting_Data/Data_for_RefOGs/trees
ls <path>/Open_Orthobench/Supporting_Data/Data_for_RefOGs/trees_tight
# # HMM profiles in this folder
# ls <path>/Open_Orthobench/Supporting_Data/Additional_Files/hmm_profiles/
# # copy the HMM files into the `hmm_profiles_strict` and `hmm_profiles_weak` folders in the present directory
-
NOTE: Fix duplicated assignment of
FBpp0309618
to theRefOG021
andRefOG068
clusters (remove from latter), format intorefOGs.csv
. -
Get primary transcripts from 17 metazoan species from the Orthobench repository, and HMM profiles for each refOG:
# primary transcripts for each species, in this folder:
ls <path>/Open_Orthobench/Supporting_Data/Additional_Files/proteomes/primary_transcripts
# copy the fasta files into the `proteomes` folder in in the present directory and concatenate them...
- Run Possvm as follows:
# find OGs in each tree with possom, using the original trees from orthobench:
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom -ogprefix "$(basename ${i%%.*})." ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_iter -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_mclw -ogprefix "$(basename ${i%%.*})." -method mclw ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_lpa -ogprefix "$(basename ${i%%.*})." -method lpa ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_lou -ogprefix "$(basename ${i%%.*})." -method louvain ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_skr -ogprefix "$(basename ${i%%.*})." -skiproot ; done
for i in orthobench_trees/raw/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_spst -ogprefix "$(basename ${i%%.*})." -spstree species_tree.newick ; done
# UNUSED
# for i in orthobench_trees/tight/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom -ogprefix "$(basename ${i%%.*})." ; done # UNUSED
- Calculate accuracy relative to
refOGs.csv
:
Rscript s02_evaluate_orthobench_master.R
Run homology searches, MSAs and new phylogenies:
# # run from the present directory
# bash s01_get_trees-diamond.sh
# # find OGs in each tree with possom, using new expanded trees (necessary to assess the effect of iterative rooting?)
# cp results_trees/*.treefile results_orthology/
# for i in results_orthology/*.treefile ; do possvm -i $i -p $(basename ${i%%.*}).possom -ogprefix "$(basename ${i%%.*})." ; done
# for i in results_orthology/*.treefile ; do possvm -i $i -p $(basename ${i%%.*}).possom_iter -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done
# # evaluate iterative rooting accuracy:
# # DEPRECATED
# Rscript s03_evaluate_orthobench_one2one_iter.R
# Rscript s04_compare_midroot_iterroot.R
Based on raw Orthobench trees, let's increase the length of random internal branches to evaluate the effect of the iterative tree rooting procedure:
- Create collection of inflated trees (20 per original tree):
Rscript s05_root_test_inflate_collection_v2.R
- Run Possvm with and without iterative rooting:
# range of inflation values
for i in results_rooting_inflated_trees/RefOG0*.tre ; do possvm -i $i -p $(basename ${i%%.tre}).possom_mid -ogprefix "$(basename ${i%%.*})." --skipprint; done
for i in results_rooting_inflated_trees/RefOG0*.tre ; do possvm -i $i -p $(basename ${i%%.tre}).possom_ite -ogprefix "$(basename ${i%%.*})." -itermidroot 10 --skipprint ; done
# DEPRECATED:
# inflation in one branch
# for i in results_rooting_inflation_one/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_mid -ogprefix "$(basename ${i%%.*})." ; done
# for i in results_rooting_inflation_one/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_ite -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done
# # low inflation (bound between 5x and 20x, for 5% of edges)
# for i in results_rooting_inflation/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_mid -ogprefix "$(basename ${i%%.*})." ; done
# for i in results_rooting_inflation/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_ite -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done
# # high inflation (bound between 5x and 50x, for 10% of edges)
# for i in results_rooting_inflation_high/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_mid -ogprefix "$(basename ${i%%.*})." ; done
# for i in results_rooting_inflation_high/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_ite -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done
# # higher inflation (bound between 5x and 50x, for 20% of edges)
# for i in results_rooting_inflation_higher/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_mid -ogprefix "$(basename ${i%%.*})." ; done
# for i in results_rooting_inflation_higher/*.tre ; do possvm -i $i -p $(basename ${i%%.*}).possom_ite -ogprefix "$(basename ${i%%.*})." -itermidroot 10 ; done
- Evaluate difference:
Rscript s06_root_test_evaluate_iterroot.R
From the orthobench-test
folder:
- Prepare taxa dictionary files:
mkdir -p results_branchclust/
for i in orthobench_trees/raw/RefOG0*.ortholog_groups.csv ; do awk 'NR>1' $i | cut -f1 -d '_' ; done | sort -u | awk '{ print $1" | " $1"_.*" }' > results_branchclust/gi_numbers.out
- Run BranchClust at various possible
<MANY>
values: 50% of the datset, 60%, 70%, and 80% (to assess tradeoff between precision and recall):
cd results_branchclust/
for i in $(cut -f1 ../refOGs.csv | sort -u ) ; do
# let's define ntax as 60% of the species in tree (recommended values: 50-80%)
# ntax=$(cat ../orthobench_trees/raw/${i}.tre| tr ',' '\n' | tr -d '()' |grep "^[A-Z]" | cut -f1 -d '_'| sort -u| wc -l | awk 'n=($1 * 0.6) { print int(n) }')
# 50%
perl ../../scripts/BranchClust_1.00.pl ../orthobench_trees/raw/${i}.tre 9
mv clusters.out ${i}.bc_clusters_50.out
mv families.list ${i}.bc_families_50.out
grep -A 1 ' CLUSTER ' ${i}.bc_clusters_50.out | grep -v " CLUSTER " | grep -v "\-\-" | awk 'NR == 1 { print "gene\torthogroup" } { for(i=1; i<=NF; i++) { print $i"\t'${i}'."NR }}' > ${i}.bc_clusters_50.csv
# 60%
perl ../../scripts/BranchClust_1.00.pl ../orthobench_trees/raw/${i}.tre 10
mv clusters.out ${i}.bc_clusters_60.out
mv families.list ${i}.bc_families_60.out
grep -A 1 ' CLUSTER ' ${i}.bc_clusters_60.out | grep -v " CLUSTER " | grep -v "\-\-" | awk 'NR == 1 { print "gene\torthogroup" } { for(i=1; i<=NF; i++) { print $i"\t'${i}'."NR }}' > ${i}.bc_clusters_60.csv
# 70%
perl ../../scripts/BranchClust_1.00.pl ../orthobench_trees/raw/${i}.tre 12
mv clusters.out ${i}.bc_clusters_70.out
mv families.list ${i}.bc_families_70.out
grep -A 1 ' CLUSTER ' ${i}.bc_clusters_70.out | grep -v " CLUSTER " | grep -v "\-\-" | awk 'NR == 1 { print "gene\torthogroup" } { for(i=1; i<=NF; i++) { print $i"\t'${i}'."NR }}' > ${i}.bc_clusters_70.csv
# 80%
perl ../../scripts/BranchClust_1.00.pl ../orthobench_trees/raw/${i}.tre 14
mv clusters.out ${i}.bc_clusters_80.out
mv families.list ${i}.bc_families_80.out
grep -A 1 ' CLUSTER ' ${i}.bc_clusters_80.out | grep -v " CLUSTER " | grep -v "\-\-" | awk 'NR == 1 { print "gene\torthogroup" } { for(i=1; i<=NF; i++) { print $i"\t'${i}'."NR }}' > ${i}.bc_clusters_80.csv
done
- Test BranchClust:
# within s02_evaluate_orthobench_master.R
From the orthobench-test
folder:
- OGs based on connected components starting from a focus species (human), using Orthobench trees:
for i in orthobench_trees/raw/*.tre ; do python ../scripts/phylome-ccs-focus-Hsap.py -i $i -p $(basename ${i%%.*}).ccsh -ogprefix "$(basename ${i%%.*})." ; done
for i in orthobench_trees/raw/*.tre ; do python ../scripts/phylome-ccs-focus-Dmel.py -i $i -p $(basename ${i%%.*}).ccsd -ogprefix "$(basename ${i%%.*})." ; done
# within s02_evaluate_orthobench_master.R
Deprecated alternatives:
- OGs based on the list of genes descending from a given duplication node (absurd):
# UNUSED
# for i in orthobench_trees/raw/*.tre ; do python ../scripts/phylome-dups.py -i $i -p $(basename ${i%%.*}).dups -ogprefix "$(basename ${i%%.*})." ; done
- OGs based on connected components, using PhylomeDB trees from a focus species (human) — this seems fair, but it's not a great idea: many reasons for sequence dropout may artifically reduce recall.
# DEPRECATED
# # get data from http://www.phylomedb.org/phylome_514 (meta human phylome with 66 species, including 15 species also present in Orthobench)
# cat ../phylome-data/phylome-0514-human/proteomes/*.fa > ../phylome-data/phylome-0514-human/proteomes_phylome.fasta
# diamond makedb --in ../phylome-data/phylome-0514-human/proteomes_phylome.fasta -d ../phylome-data/phylome-0514-human/proteomes_phylome.fasta
# # find matching sequences in phylomeDB
# mkdir -p results_phylomedb/
# for i in $(cut -f1 refOGs.csv | sort -u ) ; do
# diamond blastp --more-sensitive -d ../phylome-data/phylome-0514-human/proteomes_phylome.fasta -q results_searches/${i}.seed.fasta -o results_phylomedb/${i}.diamond_phylome.csv -p 4
# done
# # do it with all seqs
# diamond blastp -q ../phylome-data/phylome-0514-human/proteomes_phylome.fasta -d proteomes/all_proteomes.fa -o results_phylomedb/dictionary_diamond_phylome.csv -p 10
# awk '$3 == 100 { print $1,$2 }' results_phylomedb/dictionary_diamond_phylome.csv | tr ' ' '\t' > results_phylomedb/dictionary_diamond_phylome_filt.csv
# # get ortholog pairs from phylomeDB
# for i in $(cut -f1 refOGs.csv | sort -u ) ; do
# fgrep -f <(grep -w ${i} refOGs.csv | cut -f2) results_phylomedb/dictionary_diamond_phylome_filt.csv | grep "_HUMAN" | cut -f1 | sort -u > results_phylomedb/${i}.diamond_phylome.human.txt
# fgrep -f results_phylomedb/${i}.diamond_phylome.human.txt ../phylome-data/phylome-0514-human/orthologs.txt > results_phylomedb/${i}.diamond_phylome.human_orthologs.csv
# done
#
Setup of alternative methods.
- Download:
wget http://www.bioinformatics.org/branchclust/BranchClust_1-00.txt && mv BranchClust_1-00.txt ../scripts/BranchClust_1.00.pl
wget http://www.bioinformatics.org/branchclust/BranchClust_Tutorial.tgz && mv BranchClust_Tutorial.tgz ../scripts/
tar xvfz ../scripts/BranchClust_Tutorial.tgz && mv BranchClust_Tutorial/ ../scripts
wget http://bioinformatics.org/branchclust/BranchClust_all.tgz && tar xvfz BranchClust_all.tgz && mv BranchClust_all.tgz BranchClust_all ../scripts
We'll define orthogroups from a tree using naive approaches based on species overlap, and compare these orthogroups with those defined by Possvm.