scikit-tda/kepler-mapper

Filter out nodes in graph based on number of elements

holmbuar opened this issue ยท 18 comments

Is your feature request related to a problem? Please describe.
I have a point cloud with approximately 2000 data points. The clustering is based on a precomputed distance matrix. The graph produced by Mapper has a lot of nodes with between 1 and 5 elements, cluttering the visualization of the graph/ complex.

Describe the solution you'd like
I would like to specify a minimum number of elements in a node of the graph for visualization.

Describe alternatives you've considered
Filter out minor nodes from the dictionary produced by mapper.map with a dict comprehension. I have looked at scikit-learn docs for some solution on filtering out clusters with a small number of elements. I have not found such parameters when using precomputed metric. Increasing distance_treshold helps up to a certain extent.

Additional context
I would like to discuss how to remove nodes from graph without affecting the topology of the dataset.

For the record, I tried manipulating the

if precomputed:
            min_cluster_samples = 2

condition in kmapper.py. To my understanding this defines the minimal number of samples in each cube of the covering. This certainly works as a reduction method but it is difficult to know a priori what the number of samples should be. Pushing the minimum samples results in different topologies.

I guess it is safer to manipulate the graph after both clustering and mapper?

It looks like you were the one who authored the commit that added the fixed min_cluster_samples = 2 for precomputed? Why was fixing that necessary? I haven't used kepler-mapper recently, but I was using a precomputed matrix before without a fixed number. I wrote the initial precomputed logic flag, which pulls (used to pull before your commit) the min_computed_samples from the clusterer, as documented here:

precomputed : Boolean
Tell Mapper whether the data that you are clustering on is a precomputed distance matrix. If set to
`True`, the assumption is that you are also telling your `clusterer` that `metric='precomputed'` (which
is an argument for DBSCAN among others), which
will then cause the clusterer to expect a square distance matrix for each hypercube. `precomputed=True` will give a square matrix
to the clusterer to fit on for each hypercube.

If I loosely understand you commit message 2ccce2d , then it's possible that the clusterer not have n_clusters set. The logic here:

min_cluster_samples = cluster_params.get(
                "n_clusters",
                cluster_params.get(
                    "min_cluster_size", cluster_params.get("min_samples", 1)
                ),
            )

Is intended to first try n_clusters -- failing that, try min_cluster_size, -- failing that, try min_samples. What were you running where the clusterer had none of those three set which was leading to a crash?

Yes I issued the pull request you are mentioning. I seem to recall that I got a scikit-learn chrash if affinity='precomputed' and n_clusters was not None. Does that make sense?

I can not justify the fixed number 2, probably from the standard n_clusters in Scikit docs.

I must admit I did not notice the logic you are referring to, I was naively focused on fixing that specific error mentioned in the pull request. Sorry!

Want to issue a new PR that still works for your usecase which was breaking, and which re-adopts that more flexible logic?

I certainly can if I am able to get the logic :)

  1. By using affinity=precomputed, doesn't Scikit require distance_treshold to be set?
  2. How "safe" is it to manipulate min_cluster_samples a priori, compared to adjusting after Mapper has returned the complex?

By using affinity=precomputed, doesn't Scikit require distance_treshold to be set?

Which scikit modeler are you referring to? DBSCAN doesn't -- I have used DBSCAN thusly: clusterer=km.cluster.DBSCAN(eps=.5, min_samples=4, metric='precomputed')

How "safe" is it to manipulate min_cluster_samples a priori, compared to adjusting after Mapper has returned the complex?

That would be a tad hairy to manipulate afterwards because the complex could include links to clusters which might be dropped during adjustment, but in my naive opinion it's fine to do either before or after.

ah sorry for not specifying, I have used sklearn.cluster.AgglomerativeClustering 1 with single linkage. I am trying to reproduce the work of Carlsson and Gabrielsson 2.

Here is my full clustering section:

from sklearn.cluster import AgglomerativeClustering as cl
point_cloud_column_variance = np.var(point_cloud, axis=0)
point_cloud_normalized = point_cloud / point_cloud_column_variance
condensed_distance_matrix = spatial.distance.pdist(point_cloud_normalized)
redundant_distance_matrix = spatial.distance.squareform(condensed_distance_matrix)
model_vne = cl(n_clusters=None,
               affinity='precomputed',
               compute_full_tree=True,
               linkage='single',
               distance_threshold=15.0)
model_vne.fit(redundant_distance_matrix)

EDIT: added clustering import statement

I re-read your earlier comment:

To my understanding this defines the minimal number of samples in each cube of the covering. This certainly works as a reduction method but it is difficult to know a priori what the number of samples should be. Pushing the minimum samples results in different topologies.

It is important to specify here, in case you think otherwise -- hypercubes are always of a fixed interval for a given mapping run. If a given hypercube doesn't have min_samples, it is skipped -- it is not extended until it has min_samples within it. The logic is that if the hypercube doesn't have min_samples, then the clusterer run on that hypercube would definitely not find any cluster of points within that hypercube with at least min_samples.

Because it skips, rather than expands the window, there shouldn't be any difference in mapper output whether clusters are filtered out before or after the graph is created.

... except, hmm, in your case, where min_samples isn't used, I see that agglomerativeclusterer breaks when not fed at least two samples. https://github.com/scikit-learn/scikit-learn/blob/e5698bde9a8b719514bf39e6e5d58f90cfe5bc01/sklearn/cluster/_agglomerative.py#L796 I forgot to say an "otherwise" condition for this logic:

min_cluster_samples = cluster_params.get(
                "n_clusters",
                cluster_params.get(
                    "min_cluster_size", cluster_params.get("min_samples", 1)
                ),
            )

... failing the first three attempts, a min_cluster_sample size of 1 is used.

I wonder (1) whether some persons would be interested in retaining one-element clusters (I bet the answer is "yes"), and if so, (2) how best to tell if a clusterer that requires at least two samples is being used.

Perhaps the default value of n_clusters for the constructor could be used as a fallback. For agglomerativeclustering, that's 2. This method can be used to get a function's default argument values.

Or maybe if cluster_params distance_threshold is set, then default to min 2. I don't know how well that would generalize across scikit clusterers though.

@sauln what do you think about the following: allow min_cluster_samples to be passed to .map(), which would

  • be preferred over any cluster_params in determining whether or not to skip a hypercube
  • be manually checked against node size in before line 554
    for pred in np.unique(cluster_predictions):
    # if not predicted as noise
    if pred != -1 and not np.isnan(pred):
    cluster_id = "cube{}_cluster{}".format(i, int(pred))
    nodes[cluster_id] = hypercube[:, 0][cluster_predictions == pred].astype(int).tolist()
    to see if minimum cluster samples are present before adding a cluster as a node.

I don't see an easier way to handle clusterers such as agglomerativeclusterer which, in @torlarse 's case, have to have at least two samples per clustering attempt, but don't necessarily have any of n_clusters, min_cluster_size, or min_samples set. My above-idea of checking default values for n_samples doesn't generalize well to clusterers such as Birch, and also, otherwise, there is no reliable kmapper-way to specify a minimum number of samples per cluster.

It is important to specify here, in case you think otherwise -- hypercubes are always of a fixed interval for a given mapping run. If a given hypercube doesn't have min_samples, it is skipped -- it is not extended until it has min_samples within it.

Yes thanks, I hope I have this understanding of hypercubes, pardon my imprecise english. By setting min_cluster_samples=20 on my data caused Mapper to return a "broken" non-circular topology. It makes sense since by inspection I can see the nodes around the holes have between 10 and 15 elements.

Anyways, I made things work by a combination of setting min_cluster_samples=10 in kmapper.py and distance_treshold=15 in the clustering function call. I don't know if this will be a problem experienced by others.

@sauln I'm going to make the final fallback be to min of 2, not min of 1, and take out the hard fix of 2 for precomputed matrices

sauln commented

Seems reasonable to me ๐Ÿ‘

I just found this again because I have @torlarse 's original problem -- I want to filter out nodes that are too small for my liking, and min_samples doesn't do that.

@sauln I'm going to add an argument to .map that does what was originally asked -- specify a min number of points that a cluster must have in order to be retained as a node. "min_node_samples"?

I'll also add a filter to the vis, because hey why not