sanger-pathogens/Roary

Question about the fasta cluster iteratively

bucongfan opened this issue · 1 comments

I have a question about the first core step: cluster fasta by ch-hit iteratively which is also the key step to reducing the number of proteins.

Why we need to cluster iteratively instead of direct cluster once using the expected threshold?

This question that's always puzzled me and hope to get your reply

Thanks!

Its because you can get overclustering with odd centroids chosen and we can use information we already know about the dataset to improve the results.

For example, imagine we have a gene thats 100% identical in every genome, and a similar gene thats 98% identical. These would be split into 2 clusters by iteratively running cd-hit (all the genes 100% identical in all genomes go in one, the rest go in the other), which makes sense biologically. If you just ran cd-hit with a 95% threshold, then both genes would be clustered together and you would have to split the cluster manually later.