/StreamingCC

A C++ library for summarizing data streams

Primary LanguageC++

StreamingCC

A C++ library for summarizing data (data streams, in particular).

TravisCI DUB

Algorithms

StreamingCC implements various streaming algorithms and probabilistic data structures. They can be used to effectively summarize the data stream even when data is too large to fit into memory.

Algorithms/Data Structures (will be) included in StreamingCC:

Dependencies

  • CMake (>= 2.8.7)
  • C++11 support required
  • Armadillo (optional, required by some features)

How to Compile

The source code compiles to static library.

Step 1, clone to local machine

$ git clone https://github.com/jiecchen/StreamingCC.git

Step 2, compile the library

$ cd StreamingCC/
$ cmake .
$ make

Step 3, install the library to system

$ sudo make install

Example

Suppose you have sampling.cc with the following code,

#include <streamingcc>
#include <iostream>
using namespace streamingcc;

int main() {
    // create an object which will maintain
    // 10 samples (with replacement) dynamically
    ReservoirSampler<int> rsmp(10);
    // sample from a data stream with length 1,000,000
    for (int i = 0; i < 1000000; i++)
        rsmp.ProcessItem(i);

    // print the samples
    for (auto sample: rsmp.GetSamples())
        std::cout << sample << " ";
    std::cout << std::endl;
    return 0;
}

Now compile the code:

$ g++ -std=c++11 -O3 -o sampling sampling.cc -lstreamingcc

It will generate an executable file sampling. Run the binary with

$ ./sampling
916749 93283 843814 534073 877348 445467 369729 163394 67058 212209 

Recently Included

  • library to sample from noisy data, see readme for more information.
  • library for construct coreset for a dataset so that one can do clustering with outliers, see readme for more information.
  • library for embedding that preserve edit distance, see readme for details.

TODO

  • Add more docs
  • Add more python wrappers
  • Add more examples
  • Unify the interface for recently added library

Credit

Maintainer

Other contributor(s)

License

MIT License