d2suite
d2suite
is a C++ package for discrete distribution (d2) based
large-scale data processing framework. It supports distributed data analysis
of distributions at scale, such as nearest neighbors, clustering, and
some other machine learning capability. d2suite
uses templates and C++11 features
a lot, aiming to maximize its extensibility for different types of data.
d2suite
also contains a collection of computing tools supporting the analysis
of typical d2 data, such as images, sequences, documents. Contributions are welcomed.
[under construction]
Dependencies
- BLAS
- rabit: the use of generic parallel infrastructure
- mosek (version 7.1): fast LP/QP solvers, academic license available.
Make sure you have those pre-compiled libraries installed and configured in the d2suite/make.inc.
cd d2suite && make
You can run the test cases by first decompressing demo datasets in d2suite/data/test
directory,
then try
make test
Introduction
Checkout the main API and tests for a quick start.
Data Format Specifications
def::Euclidean
: discrete distribution over Euclidean spacedef::WordVec
: discrete distribution with finite possible supports in Euclidean space (aka, embeddings)def::NGram
: n-gram data with cross-term distancedef::Histogram
: dense histogram with cross-term distancedef::SparseHistogram
: sparse histogram with cross-term distancedef::Function<>
: a bag of functions that operate on vectors
Basic Functions
- distributed/serial IO
- compute distance between a pair of D2: Wasserstein distance (or EMD).
- compute lower/upper bounds of Wasserstein distance
Learnings
- K nearest neighbors [ongoing]
- D2-clustering [TBA]
- Wasserstein Mixed Membership Model
- Marriage Learning