davidemms/SHOOT

Database creation without orthofinder

Opened this issue · 1 comments

Hello,

I was wondering if this is possible to use shoot with existing database of multiple alignment and trees.
Let's say that I reproduce the same directories as orthofinder and include diamond databases, the msa (fasta) and gene trees (newick), would it be enough to get Shoot working?

Thank you

It is possible if you provide the following structure and some changes in the files:

ShootDB
    ├── Gene_Trees
    ├── MultipleSequenceAlignments
    ├── Orthogroup_Sequences
    └── WorkingDirectory
        └── Alignments_ids

All your clusters need to be renamed using the OrthoFinder nomenclature scheme: "OG" + 7 digits starting from 0 for the first cluster and following numbers for the rest. Numbers must match OrthoMCL output. It means you'll have to have a cluster name lookup table if you use a different name scheme, to rename and match OG names against your cluster names. You'll also have to adjust those names in several places (file contents).

  • Gene_Trees should contain all your phylogenic trees in newick format (NHX not supported). The name scheme is "OG name" + "_tree.txt".

  • MultipleSequenceAlignments should contain all the alignments in FASTA format. The name scheme is "OG name" + ".fa".

  • Orthogroup_Sequences should contain all the cluster sequences in FASTA format. The name scheme is "OG name" + ".fa".
    Note: all the same family sequences should be present in both the cluster fasta, its corresponding alignment and its corresponding tree.

  • WorkingDirectory should contain a set of files:

    • SpeciesIDs.txt: contains the list of the species FASTA in your dataset, one per line, following the format: <species number (starting from 0)>: <species name>.faa. Ex.: 0: arath.faa, 1: orysa.faa
    • SequenceIDs.txt: contains the list of all your dataset sequences, one per line, following the format: <species number>_<species sequence number>: <sequence name>. Ex.: 1_0: LOC_Os01g01050.1
    • Species<species number>.fa: where "species number" corresponds to the species number given in SpeciesIDs.txt file. It contains the species sequences in FASTA format. Sequence names should be using the format <species number>_<species sequence number> as described in SequenceIDs.txt.
    • clusters_OrthoFinder_I1.5.txt_id_pairs.txt: the OrthoMCL output matrix file ("out..I15" or ".I"). Cluster index must correspond to OG numbers. Sequence names (listed for each cluster) must follow another nomenclature: <species index> + underscore + <species sequence number> like specified in SequenceIDs.txt. Just all your clusters must be in that matrix, no more, no less (ie. if you discarded some clusters for your alignments and trees, they need to be removed from the matrix as well!).
    • Alignments_ids: contains the same files as MultipleSequenceAlignments but with sequence names using the SequenceIDs.txt nomenclature (<species number>_<species sequence number>).

Then, SHOOT can be used to initialize the "SHOOT database" with those command lines:

python shoot/create_shoot_db.py <your "ShootDB" path> full
python shoot/create_shoot_db.py <your "ShootDB" path> profiles
python shoot/bifurcating_trees.py <your "ShootDB" path>