/Clustering

Picks the best clustering % identity to use for a given organism/species, taking into account (in order of importance): (a) That higher % ids will produce better alignments (so set a minimum % threshold of 70%); (b) that each cluster needs to be 'alignable' (i.e. set the max size of any given cluster to 100 seqs; unless the % id is very high, e.g. => 90%, in which case the sequences are very similar so the max size of any cluster can be raised to, say, 500 seqs, and still produce a good alignment); (c) that we want to used the largest sample size of sequences possible (i.e. the minimum number of unclustered seqs and singletons, provided the first two conditions are satisfied).

Primary LanguageGo

Stargazers