How to select the neighbors genomes?
Opened this issue · 8 comments
Hi,
Thanks for developing this software.
I would like to ask how to select the neighbors genomes. Sometimes there are no sequences in the results of fur.
Thanks.
We are developing the package neighbors for that. The Tutorial in its documentation might help you complete a given set of targets and to find the corresponding neighbors.
We are developing the package neighbors for that. The Tutorial in its documentation might help you complete a given set of targets and to find the corresponding neighbors.
Thank you for your reply. Is this package currently available? I basically choose other genomes of the same genus as the neighbors, maybe I need to change the - q parameter.
Yes, that package is available. A common problem with fur is mixture between target and neighbor genomes. To make sure this doesn't happen, one should reconstruct a phylogeny from the genomes. If you'd like to see how this is done, take a look at the tutorial in the neighbors package.
Yes, that package is available. A common problem with fur is mixture between target and neighbor genomes. To make sure this doesn't happen, one should reconstruct a phylogeny from the genomes. If you'd like to see how this is done, take a look at the tutorial in the neighbors package.
Is there a way to automatically correct the target and neighbor genomes?
manual inspection is a time-consuming step and I have many species to construct these genomes. Looking forward to your reply. Thank you.
There's a semi-automatic way, which scales well to hundreds of bacterial genomes. It is described in the tutorial of the neighbors package. It consists of the following steps:
- Download target and neighbor genomes as listed by
neighbors
- Construct tree from genomes
- Label tree nodes with and inspect tree to pick the target clade
- List the leaves in the target clade and automatically create symbolic links for them to the
targets
directory - Repeat for the neighbors
- Run
makeFurDb
followed byfur
If this sounds promising to you, take a look at the neighbors tutorial for the details.
Thank you for your reply.
I have read the documentation for this package. Actually, I want to automate step 3. At present, I don't have a good idea.
It is a good question, here is a sketch of an answer: Let's say the leaves in the tree are labelled with prefix 'n' for taxonomic neighbors and prefix 't' for taxonomic targets, as explained in the neighbors tutorial. We are looking for the node, v, that maximizes the number of targets in the subtree rooted on v and the number of neighbors elsewhwere.
For example, the tree file o157.txt
contains 449 E. coli genomes, with the pathogen O157:H7 as the target. We iterate over the 448 internal nodes, which were labeled with lande
, and for each node calculate the node score consisting of the sum of the targets found in that node and the neighbors found outside:
for a in $(seq 448); do
t=$(pickle $a o157.txt | grep '^t' | wc -l)
n=$(pickle -c $a o157.txt | grep '^n' | wc -l)
((s=$t+$n))
echo $a $s
done > score.txt
We sort the scores to find that the desired node is 300, which is also the one I'd pick from visual inspection.
sort -n -k 2 -r score.txt | head
Hope this helps.