consistency of clusters generated with different parameters (affinityprop to self)
Closed this issue · 2 comments
Hello,
As mentioned earlier, I have noted the differences between clusters generated by apcluster
vs computed using affinityprop
. wanted to check how much the clusters will differ if I run the same program using different set of parameters. Since affinityprop
is so much faster, I started with it to get an idea. Tested settings (ongoing):
convergence_iter
: 100, 400, 600damping
: 0.50, 0.75, 0.80, 0.90max_iter
: 1000, 2000, 4000, 5000, 7500, 10000- remaining params,
precision
&preference
: default - input: a fairly large 21k similarity matrix.
Jobs for the max_iter
>= 5000 are still running, but I can report few findings:
- increasing
convergence_iter
and/ormax_iter
changes nothing/very little:
# calculated with the Python script below
Adjusted Rand Score: affprop_21k_damp0.5_conv100_maxit1000.out vs affprop_21k_damp0.5_conv100_maxit5000.out 1.0
- changing
damping
does influence the results but not a lot
Adjusted Rand Score: affprop_21k_damp0.5_conv400_maxit4000.out vs affprop_21k_damp0.75_conv400_maxit4000.out 0.904519555779339
Adjusted Rand Score: affprop_21k_damp0.5_conv400_maxit4000.out vs affprop_21k_damp0.75_conv400_maxit4000.out 0.904519555779339
In summary:
- I am really happy with the consistency of results calculated by
affinityprop
. - tuning of parameters to make the results more apcluster-like seems futile.
Hope it helps.
Darek Kedra
PS I will submit some comparisons apcluster
vs affinityprop
later, hopefully today.
#!/usr/bin/env python
"""
compare 2 affinityprop outputs
usage:
./clust_dist.py affprop_result1.out affprop_result2.out
tested with:
python 3.10.2 h62f1059_0_cpython conda-forge
scikit-learn 1.0.2 py310h1246948_0 conda-forge
"""
import sys
from sklearn.metrics.cluster import adjusted_rand_score
def get_sample_clustid_list(fasta_fn):
sample_clustid_dict = {}
elem_clustid_list = []
with open(fasta_fn, "r") as fh:
next(fh)
for line in fh:
if line.startswith(">"):
sl = line.split()
cluster_id = int(sl[0].split("=")[1])
#print(cluster_id)
else:
line = line.rstrip()
clust_members = line.split(",")
clust_members = [int(x) for x in clust_members]
for sample in clust_members:
sample_clustid_dict[sample] = cluster_id
for key in sorted(sample_clustid_dict):
elem_clustid_list.append( sample_clustid_dict[key] )
return elem_clustid_list
if __name__ == "__main__":
clusters_1_fn = sys.argv[1]
clusters_2_fn = sys.argv[2]
clusters_1_list = get_sample_clustid_list(clusters_1_fn)
clusters_2_list = get_sample_clustid_list(clusters_2_fn)
rand_score = adjusted_rand_score(clusters_1_list, clusters_2_list)
print(f"Adjusted Rand Score: {clusters_1_fn} vs {clusters_2_fn} {rand_score}")
Hi Darek,
Thank you for taking the time to do these tests! I am gathering more tests/use cases, so I greatly appreciate the extra information
Chris
Closing this issue for now - F1 and ARI calculations are listed in documentation/README. Thank you again for running this analysis!