/quipu

Implements a bunch of Probabilistic data structures to count cardinality of a set

Primary LanguageClojureEclipse Public License 1.0EPL-1.0

Quipu

A Clojure library which implements a bunch of Probabilistic data structures to count cardinality of a set. Most common use of this kind of structure is to find out "unique visitors".

Installation

For Leiningen project use

[net.kirankulkarni/quipu "0.1.0"]

Counting Methods

Currently this library supports two counting methods Linear Counting and LogLog Counting.

Linear Counting

As the name suggests space required for this algorithm is linearly proportional to actual cardinality. It gives cardinality with 1% error at most.

Please refer the original paper: Kyu-Young Whang , Brad T. Vander-Zanden , Howard M. Taylor, A linear-time probabilistic counting algorithm for database applications, ACM Transactions on Database Systems (TODS), v.15 n.2, p.208-229, June 1990

Usage

You can create a linear counter by passing expected cardinality of the set as an argument. If there is no way of knowing expected cardinality then use number of elements (including duplicates) to create a counter.

In following example we will create a linear counter for 1,000 unique elements. We create a loop which runs 10,000 times each time inserting a number from 0-999. At the end we will get cardinality of counter

user> (require '[net.kirankulkarni.quipu.core :refer :all])
nil
user> (require '[net.kirankulkarni.quipu.linear :refer [get-linear-counter]])
nil
user> (let [counter (get-linear-counter 1000)
            elements (mapv identity (range 0 1000))]
        (println "byts used" (size counter))
        (doseq [n (range 10000)]
          (add counter (get elements (mod n 1000))))
        (get-card counter))

bytes used 666
999

It returned 999 which is estimated cardinality. It used 666 Bytes i.e. 0.6KB for 1,000 elements

LogLog Counting

Space required for this algorithm is logarithmically proportional to actual cardinality. This algorithm works with buckets of bits. Unlike Linear counter here you can tune expected error (which range from (0.01% to 90%). Generally with loglog counters 4% expected error is used

Please refer the original paper: Loglog counting of large cardinalities - Durand, Flajolet - 2003

Usage

You can create a loglog counter by passing exepcted cardinality and expected error rate. We will use similar example we used in Linear Counting

In following example we will create a linear counter for 10,000 unique elements. We create a loop which runs 100,000 times each time inserting a number from 0-9999. At the end we will get cardinality of counter

user> (require '[net.kirankulkarni.quipu.core :refer :all])
nil
user> (require '[net.kirankulkarni.quipu.loglog :refer [get-loglog-counter]])
nil
user> (let [counter (get-loglog-counter 10000 0.04)
            elements (mapv identity (range 0 10000))]
        (println "bytes used" (size counter))
        (doseq [n (range 100000)]
          (add counter (get elements (mod n 10000))))
        (int (get-card counter)))
bytes used 640
10009

As we see it returned 10009 as estimated cardinality. It used 640 bytes i.e. 0.6KB for 10,000 elements

Performance

I calculated number of unique words from The Complete Work of Shakespeare It has total 1,410,671 words (including duplicates). I created linear and logical counter using this number, and for loglog counting I am using 4% error. Here are the results of the same

Actual count of unique words is : 59,724

MethodEstimated CountBytes Used
Clojure Set59,7245,421,192 ~ 5MB
Linear Counting59,86825,439 ~ 24KB
Loglog Counting59,774640 ~ 0.6 KB

License

Copyright © 2013 Kiran Kulkarni

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.