Filter out nodes in graph based on number of elements
holmbuar opened this issue ยท 18 comments
Is your feature request related to a problem? Please describe.
I have a point cloud with approximately 2000 data points. The clustering is based on a precomputed distance matrix. The graph produced by Mapper has a lot of nodes with between 1 and 5 elements, cluttering the visualization of the graph/ complex.
Describe the solution you'd like
I would like to specify a minimum number of elements in a node of the graph for visualization.
Describe alternatives you've considered
Filter out minor nodes from the dictionary produced by mapper.map
with a dict comprehension. I have looked at scikit-learn
docs for some solution on filtering out clusters with a small number of elements. I have not found such parameters when using precomputed metric. Increasing distance_treshold
helps up to a certain extent.
Additional context
I would like to discuss how to remove nodes from graph without affecting the topology of the dataset.
For the record, I tried manipulating the
if precomputed:
min_cluster_samples = 2
condition in kmapper.py
. To my understanding this defines the minimal number of samples in each cube of the covering. This certainly works as a reduction method but it is difficult to know a priori what the number of samples should be. Pushing the minimum samples results in different topologies.
I guess it is safer to manipulate the graph after both clustering and mapper?
It looks like you were the one who authored the commit that added the fixed min_cluster_samples = 2
for precomputed? Why was fixing that necessary? I haven't used kepler-mapper recently, but I was using a precomputed matrix before without a fixed number. I wrote the initial precomputed logic flag, which pulls (used to pull before your commit) the min_computed_samples
from the clusterer, as documented here:
kepler-mapper/kmapper/kmapper.py
Lines 396 to 401 in 2ccce2d
If I loosely understand you commit message 2ccce2d , then it's possible that the clusterer not have n_clusters
set. The logic here:
min_cluster_samples = cluster_params.get(
"n_clusters",
cluster_params.get(
"min_cluster_size", cluster_params.get("min_samples", 1)
),
)
Is intended to first try n_clusters
-- failing that, try min_cluster_size
, -- failing that, try min_samples
. What were you running where the clusterer had none of those three set which was leading to a crash?
Yes I issued the pull request you are mentioning. I seem to recall that I got a scikit-learn
chrash if affinity='precomputed'
and n_clusters
was not None
. Does that make sense?
I can not justify the fixed number 2, probably from the standard n_clusters
in Scikit docs.
I must admit I did not notice the logic you are referring to, I was naively focused on fixing that specific error mentioned in the pull request. Sorry!
Want to issue a new PR that still works for your usecase which was breaking, and which re-adopts that more flexible logic?
I certainly can if I am able to get the logic :)
- By using
affinity=precomputed
, doesn't Scikit requiredistance_treshold
to be set? - How "safe" is it to manipulate
min_cluster_samples
a priori, compared to adjusting after Mapper has returned the complex?
By using affinity=precomputed, doesn't Scikit require distance_treshold to be set?
Which scikit modeler are you referring to? DBSCAN doesn't -- I have used DBSCAN thusly: clusterer=km.cluster.DBSCAN(eps=.5, min_samples=4, metric='precomputed')
How "safe" is it to manipulate min_cluster_samples a priori, compared to adjusting after Mapper has returned the complex?
That would be a tad hairy to manipulate afterwards because the complex could include links to clusters which might be dropped during adjustment, but in my naive opinion it's fine to do either before or after.
Here is my full clustering section:
from sklearn.cluster import AgglomerativeClustering as cl
point_cloud_column_variance = np.var(point_cloud, axis=0)
point_cloud_normalized = point_cloud / point_cloud_column_variance
condensed_distance_matrix = spatial.distance.pdist(point_cloud_normalized)
redundant_distance_matrix = spatial.distance.squareform(condensed_distance_matrix)
model_vne = cl(n_clusters=None,
affinity='precomputed',
compute_full_tree=True,
linkage='single',
distance_threshold=15.0)
model_vne.fit(redundant_distance_matrix)
EDIT: added clustering import statement
I re-read your earlier comment:
To my understanding this defines the minimal number of samples in each cube of the covering. This certainly works as a reduction method but it is difficult to know a priori what the number of samples should be. Pushing the minimum samples results in different topologies.
It is important to specify here, in case you think otherwise -- hypercubes are always of a fixed interval for a given mapping run. If a given hypercube doesn't have min_samples, it is skipped -- it is not extended until it has min_samples within it. The logic is that if the hypercube doesn't have min_samples, then the clusterer run on that hypercube would definitely not find any cluster of points within that hypercube with at least min_samples.
Because it skips, rather than expands the window, there shouldn't be any difference in mapper output whether clusters are filtered out before or after the graph is created.
... except, hmm, in your case, where min_samples isn't used, I see that agglomerativeclusterer breaks when not fed at least two samples. https://github.com/scikit-learn/scikit-learn/blob/e5698bde9a8b719514bf39e6e5d58f90cfe5bc01/sklearn/cluster/_agglomerative.py#L796 I forgot to say an "otherwise" condition for this logic:
min_cluster_samples = cluster_params.get(
"n_clusters",
cluster_params.get(
"min_cluster_size", cluster_params.get("min_samples", 1)
),
)
... failing the first three attempts, a min_cluster_sample size of 1 is used.
I wonder (1) whether some persons would be interested in retaining one-element clusters (I bet the answer is "yes"), and if so, (2) how best to tell if a clusterer that requires at least two samples is being used.
Perhaps the default value of n_clusters
for the constructor could be used as a fallback. For agglomerativeclustering, that's 2
. This method can be used to get a function's default argument values.
Or maybe if cluster_params distance_threshold
is set, then default to min 2
. I don't know how well that would generalize across scikit clusterers though.
@sauln what do you think about the following: allow min_cluster_samples
to be passed to .map()
, which would
- be preferred over any cluster_params in determining whether or not to skip a hypercube
- be manually checked against node size in before line 554
kepler-mapper/kmapper/kmapper.py
Lines 549 to 554 in 4c52847
I don't see an easier way to handle clusterers such as agglomerativeclusterer which, in @torlarse 's case, have to have at least two samples per clustering attempt, but don't necessarily have any of n_clusters
, min_cluster_size
, or min_samples
set. My above-idea of checking default values for n_samples
doesn't generalize well to clusterers such as Birch, and also, otherwise, there is no reliable kmapper-way to specify a minimum number of samples per cluster.
It is important to specify here, in case you think otherwise -- hypercubes are always of a fixed interval for a given mapping run. If a given hypercube doesn't have min_samples, it is skipped -- it is not extended until it has min_samples within it.
Yes thanks, I hope I have this understanding of hypercubes, pardon my imprecise english. By setting min_cluster_samples=20
on my data caused Mapper to return a "broken" non-circular topology. It makes sense since by inspection I can see the nodes around the holes have between 10 and 15 elements.
Anyways, I made things work by a combination of setting min_cluster_samples=10
in kmapper.py
and distance_treshold=15
in the clustering function call. I don't know if this will be a problem experienced by others.
@sauln I'm going to make the final fallback be to min of 2, not min of 1, and take out the hard fix of 2 for precomputed matrices
Seems reasonable to me ๐
I just found this again because I have @torlarse 's original problem -- I want to filter out nodes that are too small for my liking, and min_samples
doesn't do that.
@sauln I'm going to add an argument to .map that does what was originally asked -- specify a min number of points that a cluster must have in order to be retained as a node. "min_node_samples"?
I'll also add a filter to the vis, because hey why not