/unicount

Memory efficient algorithm for estimating number of unique elements in a sequence

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Distinct Elements in Streams

I was intrigued by how simple the Quanta article Computer Scientists Invent an Efficient New Way to Count seemed so I tried to implement it myself. Turns out the article had a few errors within the description (check the comments), but I was eventually able to get something working.

I made it into a simple script so that you can pass it a file and it will provide an estimate of the number of distinct elements within that file.

Usage

wget https://raw.githubusercontent.com/samedhi/unicount/main/unicount.py \
&& wget https://raw.githubusercontent.com/samedhi/unicount/main/hamlet.txt \
&& python unicount.py hamlet.txt

Reference