cjneely10/affinityprop

consistency of clusters generated with different parameters (affinityprop to self)

Closed this issue · 2 comments

Hello,

As mentioned earlier, I have noted the differences between clusters generated by apcluster vs computed using affinityprop. wanted to check how much the clusters will differ if I run the same program using different set of parameters. Since affinityprop is so much faster, I started with it to get an idea. Tested settings (ongoing):

  • convergence_iter: 100, 400, 600
  • damping: 0.50, 0.75, 0.80, 0.90
  • max_iter: 1000, 2000, 4000, 5000, 7500, 10000
  • remaining params, precision & preference: default
  • input: a fairly large 21k similarity matrix.

Jobs for the max_iter >= 5000 are still running, but I can report few findings:

  • increasing convergence_iter and/or max_iter changes nothing/very little:
# calculated with the Python script below
Adjusted Rand Score: affprop_21k_damp0.5_conv100_maxit1000.out vs affprop_21k_damp0.5_conv100_maxit5000.out 1.0
  • changing damping does influence the results but not a lot
Adjusted Rand Score: affprop_21k_damp0.5_conv400_maxit4000.out vs affprop_21k_damp0.75_conv400_maxit4000.out 0.904519555779339

Adjusted Rand Score: affprop_21k_damp0.5_conv400_maxit4000.out vs affprop_21k_damp0.75_conv400_maxit4000.out 0.904519555779339

In summary:

  • I am really happy with the consistency of results calculated by affinityprop.
  • tuning of parameters to make the results more apcluster-like seems futile.

Hope it helps.

Darek Kedra
PS I will submit some comparisons apcluster vs affinityprop later, hopefully today.

#!/usr/bin/env python
"""
compare 2 affinityprop outputs

usage:

./clust_dist.py affprop_result1.out affprop_result2.out

tested with:
python                    3.10.2          h62f1059_0_cpython    conda-forge
scikit-learn              1.0.2           py310h1246948_0    conda-forge

"""

import sys
from sklearn.metrics.cluster import adjusted_rand_score

def get_sample_clustid_list(fasta_fn):
    sample_clustid_dict = {}
    elem_clustid_list = []
    
    with open(fasta_fn, "r") as fh:
        next(fh)
        for line in fh:
            if line.startswith(">"):
                sl = line.split()
                cluster_id = int(sl[0].split("=")[1])
                #print(cluster_id)
            else:
                line = line.rstrip()
                clust_members = line.split(",")
                clust_members = [int(x) for x in clust_members]
                for sample in clust_members:
                    sample_clustid_dict[sample] = cluster_id
    
    for key in sorted(sample_clustid_dict):
        elem_clustid_list.append( sample_clustid_dict[key] )
    return  elem_clustid_list   

if __name__ == "__main__":
    clusters_1_fn = sys.argv[1]
    clusters_2_fn = sys.argv[2]
    
    clusters_1_list = get_sample_clustid_list(clusters_1_fn)
    clusters_2_list = get_sample_clustid_list(clusters_2_fn)
    
    rand_score = adjusted_rand_score(clusters_1_list, clusters_2_list)
    print(f"Adjusted Rand Score: {clusters_1_fn} vs {clusters_2_fn} {rand_score}")
 

Hi Darek,

Thank you for taking the time to do these tests! I am gathering more tests/use cases, so I greatly appreciate the extra information

Chris

Closing this issue for now - F1 and ARI calculations are listed in documentation/README. Thank you again for running this analysis!