Database creation without orthofinder
Opened this issue · 1 comments
Hello,
I was wondering if this is possible to use shoot with existing database of multiple alignment and trees.
Let's say that I reproduce the same directories as orthofinder and include diamond databases, the msa (fasta) and gene trees (newick), would it be enough to get Shoot working?
Thank you
It is possible if you provide the following structure and some changes in the files:
ShootDB
├── Gene_Trees
├── MultipleSequenceAlignments
├── Orthogroup_Sequences
└── WorkingDirectory
└── Alignments_ids
All your clusters need to be renamed using the OrthoFinder nomenclature scheme: "OG" + 7 digits starting from 0 for the first cluster and following numbers for the rest. Numbers must match OrthoMCL output. It means you'll have to have a cluster name lookup table if you use a different name scheme, to rename and match OG names against your cluster names. You'll also have to adjust those names in several places (file contents).
-
Gene_Trees
should contain all your phylogenic trees in newick format (NHX not supported). The name scheme is "OG name" + "_tree.txt". -
MultipleSequenceAlignments
should contain all the alignments in FASTA format. The name scheme is "OG name" + ".fa". -
Orthogroup_Sequences
should contain all the cluster sequences in FASTA format. The name scheme is "OG name" + ".fa".
Note: all the same family sequences should be present in both the cluster fasta, its corresponding alignment and its corresponding tree. -
WorkingDirectory
should contain a set of files:SpeciesIDs.txt
: contains the list of the species FASTA in your dataset, one per line, following the format:<species number (starting from 0)>: <species name>.faa
. Ex.:0: arath.faa
,1: orysa.faa
SequenceIDs.txt
: contains the list of all your dataset sequences, one per line, following the format:<species number>_<species sequence number>: <sequence name>
. Ex.:1_0: LOC_Os01g01050.1
Species<species number>.fa
: where "species number" corresponds to the species number given inSpeciesIDs.txt
file. It contains the species sequences in FASTA format. Sequence names should be using the format<species number>_<species sequence number>
as described inSequenceIDs.txt
.clusters_OrthoFinder_I1.5.txt_id_pairs.txt
: the OrthoMCL output matrix file ("out..I15" or ".I"). Cluster index must correspond to OG numbers. Sequence names (listed for each cluster) must follow another nomenclature:<species index> + underscore + <species sequence number>
like specified inSequenceIDs.txt
. Just all your clusters must be in that matrix, no more, no less (ie. if you discarded some clusters for your alignments and trees, they need to be removed from the matrix as well!).Alignments_ids
: contains the same files asMultipleSequenceAlignments
but with sequence names using theSequenceIDs.txt
nomenclature (<species number>_<species sequence number>
).
Then, SHOOT can be used to initialize the "SHOOT database" with those command lines:
python shoot/create_shoot_db.py <your "ShootDB" path> full
python shoot/create_shoot_db.py <your "ShootDB" path> profiles
python shoot/bifurcating_trees.py <your "ShootDB" path>