/d2suite

A C++ package for discrete distribution based large-scale data processing framework

Primary LanguageC++OtherNOASSERTION

d2suite

d2suite is a C++ package for discrete distribution (d2) based large-scale data processing framework. It supports distributed data analysis of distributions at scale, such as nearest neighbors, clustering, and some other machine learning capability. d2suite uses templates and C++11 features a lot, aiming to maximize its extensibility for different types of data.

d2suite also contains a collection of computing tools supporting the analysis of typical d2 data, such as images, sequences, documents. Contributions are welcomed.

[under construction]

Dependencies

  • BLAS
  • rabit: the use of generic parallel infrastructure
  • mosek (version 7.1): fast LP/QP solvers, academic license available.

Make sure you have those pre-compiled libraries installed and configured in the d2suite/make.inc.

cd d2suite && make

You can run the test cases by first decompressing demo datasets in d2suite/data/test directory, then try

make test

Introduction

Checkout the main API and tests for a quick start.

Data Format Specifications

  • def::Euclidean: discrete distribution over Euclidean space
  • def::WordVec: discrete distribution with finite possible supports in Euclidean space (aka, embeddings)
  • def::NGram: n-gram data with cross-term distance
  • def::Histogram: dense histogram with cross-term distance
  • def::SparseHistogram: sparse histogram with cross-term distance
  • def::Function<>: a bag of functions that operate on vectors

Basic Functions

  • distributed/serial IO
  • compute distance between a pair of D2: Wasserstein distance (or EMD).
  • compute lower/upper bounds of Wasserstein distance

Learnings

  • K nearest neighbors [ongoing]
  • D2-clustering [TBA]
  • Wasserstein Mixed Membership Model
  • Marriage Learning