KeplerMapper
Nature uses as little as possible of anything. - Johannes Kepler
This is a class containing a mapping algorithm in Python. KeplerMapper can be used for visualization of high-dimensional data and 3D point cloud data.
KeplerMapper employs approaches based on the MAPPER algorithm (Singh et al.) as first described in the paper "Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition".
KeplerMapper can make use of Scikit-Learn API compatible cluster and scaling algorithms.
Usage
Python code
# Import the class
import km
# Some sample data
from sklearn import datasets
data, labels = datasets.make_circles(n_samples=5000, noise=0.03, factor=0.3)
# Initialize
mapper = km.KeplerMapper(verbose=1)
# Fit to and transform the data
projected_data = mapper.fit_transform(data, projection=[0,1]) # X-Y axis
# Create dictionary called 'complex' with nodes, edges and meta-information
complex = mapper.map(projected_data, data, nr_cubes=10)
# Visualize it
mapper.visualize(complex, path_html="make_circles_keplermapper_output.html",
title="make_circles(n_samples=5000, noise=0.03, factor=0.3)")
Console output
..Projecting data using: [0, 1]
..Scaling with: MinMaxScaler(copy=True, feature_range=(0, 1))
Mapping on data shaped (5000L, 2L) using dimensions
Creating 1000 hypercubes.
created 86 edges and 57 nodes in 0:00:03.614000.
Wrote d3.js graph to 'make_circles_keplermapper_output.html'
Visualization output
Click here for an interactive version. Click here for an older interactive version.
Install
The class is currently just one file. Simply dropping it in any directory which Python is able to import from should work.
Required
These libraries are required to be installed for KeplerMapper to work:
- NumPy
- Scikit-Learn
KeplerMapper works on both Python 2.7 and Python 3+.
External resources
These resources are loaded by the visualization output.
- Roboto Webfont (Google)
- D3.js (Mike Bostock)
Parameters
Initialize
mapper = km.KeplerMapper(verbose=1)
Parameter | Description |
---|---|
verbose | Int. Verbosity of the mapper. Default = 0 |
Fitting and transforming
Input the data set. Specify a projection/lens type. Output the projected data/lens.
projected_data = mapper.fit_transform(data, projection="sum",
scaler=km.preprocessing.MinMaxScaler() )
Parameter | Description |
---|---|
data | Numpy Array. The data to fit a projection/lens to. Required |
projection | Any of: list with dimension indices. Scikit-learn API compatible manifold learner or dimensionality reducer. A string from ["sum","mean","median","max","min","std","dist_mean"]. Default = "sum" |
scaler | Scikit-Learn API compatible scaler. Scaler of the data applied before mapping. Use None for no scaling. Default = preprocessing.MinMaxScaler() |
Mapping
topological_network = mapper.map(projected_X, inverse_X=None,
clusterer=cluster.DBSCAN(eps=0.5,min_samples=3),
nr_cubes=10, overlap_perc=0.1)
print(topological_network["nodes"])
print(topological_network["links"])
print(topological_network["meta"])
Parameter | Description |
---|---|
projected_X | Numpy array. Output from fit_transform. Required |
inverse_X | Numpy array or None . When None , cluster on the projection, else cluster on the original data (inverse image). |
clusterer | Scikit-Learn API compatible clustering algorithm. The clustering algorithm to use for mapping. Default = cluster.DBSCAN(eps=0.5,min_samples=3) |
nr_cubes | Int. The number of cubes/intervals to create. Default = 10 |
overlap_perc | Float. How much the cubes/intervals overlap (relevant for creating the edges). Default = 0.1 |
Visualizing
mapper.visualize(topological_network,
path_html="mapper_visualization_output.html")
Parameter | Description |
---|---|
topological_network | Dict. The topological_network -dictionary with nodes, edges and meta-information. Required |
path_html | File path. Path where to output the .html file Default = mapper_visualization_output.html |
title | String. Document title for use in the outputted .html. Default = "My Data" |
graph_link_distance | Int. Global length of links between nodes. Use less for larger graphs. Default = 30 |
graph_charge | Int. The charge between nodes. Use less negative charge for larger graphs. Default = -120 |
graph_gravity | Float. A weak geometric constraint similar to a virtual spring connecting each node to the center of the layout's size. Don't you set to negative or it's turtles all the way up. Default = 0.1 |
custom_tooltips | NumPy Array. Create custom tooltips for all the node members. You could use the target labels y for this. Use None for standard tooltips. Default = None. |
show_title | Bool. Whether to show the title. Default = True |
show_meta | Bool. Whether to show meta information, like the overlap percentage and the clusterer used. Default = True |
show_tooltips | Bool. Whether to show the tooltips on hover. Default = True |
width_html | Int. Size in pixels of the graph canvas width. Default = 0 (full screen width) |
height_html | Int. Size in pixels of the graph canvas height. Default = 0 (full screen height) |
Examples
3D-point cloud
Check the examples
directory for more.
Very noisy datasets
Check the examples\makecircles
directory for code
Dimensionality reduction
t-SNE on 4K images of MNIST dataset.
References
Mapper Algorithm
"Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition"
Gurjeet Singh, Facundo Mémoli, and Gunnar Carlsson
Topological Data Analysis
Stanford Seminar. "Topological Data Analysis: How Ayasdi used TDA to Solve Complex Problems"
SF Data Mining. "Shape and Meaning."
Anthony Bak
https://www.youtube.com/watch?v=x3Hl85OBuc0
https://www.youtube.com/watch?v=4RNpuZydlKY
Projection vs. Inverse image & Examples
MLconf ATL. Topological Learning with Ayasdi
Allison Gilmore
https://www.youtube.com/watch?v=cJ8W0ASsnp0
The shape of data
"Conference Talk. The shape of data"
Topology and Data
Gunnar Carlsson
https://www.youtube.com/watch?v=kctyag2Xi8o http://www.ams.org/images/carlsson-notes.pdf
Business Value, Problems, Algorithms, Computation and User Experience of TDA
Data Driven NYC. "Making Data Work"
Gurjeet Singh
https://www.youtube.com/watch?v=UZH5xJXJG2I
Implementation details and sample data
Python Mapper
Daniel Müllner and Aravindakshan Babu
http://danifold.net/mapper/index.html
Applied Topology
"Elementary Applied Topology"
R. Ghrist
https://www.math.upenn.edu/~ghrist/notes.html
Applied Topology
"Qualitative data analysis"
Community effort
Single Linkage Clustering
"Minimum Spanning Trees and Single Linkage Cluster Analysis"
J. C. Gower, and G. J. S. Ross
http://www.cs.ucsb.edu/~veronika/MAE/mstSingleLinkage_GowerRoss_1969.pdf
Clustering and Manifold Learning
Scikit-learn: Machine Learning in Python
Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.
http://scikit-learn.org/stable/modules/clustering.html
http://scikit-learn.org/stable/modules/manifold.html
Force-directed Graphing/Clustering
Force-directed Graphs
Mike Bostock, Tim Dwyer, Thomas Jakobsen
http://bl.ocks.org/mbostock/4062045
Graphing
Grapher
Cindy Zhang, Danny Cochran, Diana Suvorova, Curtis Mitchell
https://github.com/ayasdi/grapher
Color scales
"Creating A Custom Hot to Cold Temperature Color Gradient for use with RRDTool"
Dale Reagan
Design
Material Design
Design
Ayasdi Core Product Screenshots
Ayasdi
http://www.ayasdi.com/product/core/
Disclaimer
See disclaimer.txt for more. Basically this is a work in progress to familiarize myself with topological data analysis. The details of the algorithm implementations may be lacking. I'll gladly accept feedback and pull requests to make it more robust. You can contact me at info@mlwave.com or by opening an issue.