The scalable topological data analysis package for Apache Spark. This project aims to implement the following features:
- Scalable Mapper Implemented as Reeb Diagrams, i.e., Reeb Cosheaves
- Scalable Mapper Implementation
- Scalable Multiscale Mapper Implementation
- Scalable Tower Computation for Multiscale Mapper
- Scalable Persistent Homology Computation on Top of Apache Spark
If you would like to know how to use and/or learn more the implementation details of the above mentioned features, please follow the links.
WIP and EXPERIMENTAL. This package is still a proof-of-concept of scalable topological data analysis support for Apache Spark, hence you cannot expect that this package is ready for production use.
This library requires Spark 2.0+
To compile this project, run sbt package
from the project home directory. This will also run the Scala unit tests.
To run the unit tests, run sbt test
from the project home directory. This project uses the
sbt-spark-package plugin, which provides the 'spPublish' and
'spPublishLocal' task. We recommend users to use this library with Apache Spark including the dependencies by
supplying a comma-delimited list of Maven coordinates with --packages
and download the package from the locally
repository or official Spark Packages repository.
$ sbt spPublishLocal
The package can be published to Spark Packages with (requires authentication and authorization):
$ sbt spPublish
This package can be added to Spark using the --packages
command line option. For example, to include it when starting
the spark shell:
$ spark-shell --packages ognis1205:spark-tda:0.0.1-SNAPSHOT-spark2.2-s_2.11
- Write Wiki
- Implement Python APIs
- Publish to Spark Packages
- Benchmark
- Consider using GraphFrames instead of plain GraphX
- Implement some useful filter functions, e.g., Gaussian Density, Graph Laplacian, etc as transformers
- G. Singh, F. Memoli, G. Carlsson (2007). Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition, Point Based Graphics 2007, Prague, September 2007.
- J. Curry (2013). Sheaves, Cosheaves and Applications, arXiv 2013
- T. K. Dey, F. Memoli, Y. Wang (2015), Mutiscale Mapper: A Framework for Topological Summarization of Data and Maps, arXiv 2015
- E. Munch, B. Wang (2015). Convergence between Categorical Representations of Reeb Space and Mapper, arXiv 2015
- E. Munch, B. Wang (2015). Reeb Space Approximation with Guarantees, The 25th Fall Workshop on Computational Geometry 2015.
- H. E. Kim (2015). Evaluating Ayasdi's Topological Data Analysis for Big Data, Master Thesis, Goethe University Frankfurt 2015.
- L. Ting, et al (2004). An investigation of practical approximate nearest neighbor algorithms, Advances in neural information processing systems. 2004.
- L. Ting, C. Rosenberg, H. Rowley (2007). Clustering billions of images with large scale nearest neighbor search. Applications of Computer Vision, 2007. WACV'07. IEEE Workshop on. IEEE, 2007.
- D. Ravichandran, P. Pantel, E. Hovy (2005). Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering, ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics pp 622-629
- M. Steinbach, L. Ertoez, V. Kumar (2004). The Challenges of Clustering High Dimensional Data, New Directions in Statistical Physics, pp 273-309
- L. Ertoez, M. Steinbach, Vipin Kumar (2003). Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, Proceedings of the Third SIAM International Conference on Data Mining, 2003.
- M. E. Houle, H. P. Kriegel, P. Kroeger, E. S. A. Zimek (2010). Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?, Proceedings of the 22nd International Conference on Scientific and Statistical Database Management, 2010.