eead-csic-compbio/get_homologues

Pangenome question

Closed this issue · 1 comments

Hello,

I have used get_homologues and anvi'o to predict the pangenome of four different species. Anvi'o pangenome sizes were consistently smaller and I was wondering why this could be? How does get_homologues determine gene clusters? is it maybe less stringent? I used the methods from "4.9.1 Obtaining a pangenome matrix" with OMCL and COGS to obtain the get_homologues pangenomes.

GET_HOMOLGOUES pangenome: Cpr 3777, Cps 4978, Cac 4488, Ctu 3232

Anvio pan genome: Cpr 3108, Cps 3590, Cac 3427, Ctu 2907

percent difference:
19.4 Cpr
32.4 Cps
26.8 Cac
10.6 Ctu

Hi @TommyH-Tran , if you are using default params that would means

-C min %coverage in BLAST pairwise alignments                  (range [1-100],default=75)
-S min %sequence identity in BLAST query/subj pairs            (range [1-100],default=1 [BDBH|OMCL])

Those have worked well in our experience in general groups of bacteria and the fact that you are using the OMCL-COGS intersection should give more confidence in your set of clusters. Some ideas:

  1. Check a few clusters private to GET_HOMOLOGUES
  2. Use -D to ensure all sequences in a cluster share the same Pfam domains, this will increase stringency
  3. In the original paper (https://journals.asm.org/doi/full/10.1128/aem.02411-13) we already reported that GET_HOMOLOGUES was able to capture many orthogroups missed by OMA

Hope this helps,
Bruno