/entropy

Entropy estimator

Primary LanguageC++

Entropy estimators

We provide implementation of our rate-optimal estimators and some classical estimators in both Python and C++. This is a tutorial for both software users and developers whoever want to use entropy estimators!

Table of contents

Python

This Python code works on Python3, and uses Numpy, the fundamental package for scientific computing with Python. Make sure Python3 and Numpy are properly installed before using this code.

Basic script

Here is an example on how to use entropy estimators in Python:

>>> from entropy import *
>>> entropy = Entropy(k=100000)
>>> fin = [[1,2910],[2,45]]
>>> entropy.estimate(fin)
15.794990467501666

The entropy estimate output 15.794990 is in bits.

  • from entropy import * imports all functions from entropy.py;
  • entropy = Entropy(k=100000) initializes an entropy estimator with alphabet size 100,000, an upper bound on the support size. We can use a conservative upper bound and the estimator is insensitive to that. In the case no upper bound is unavailable, an alternative option is to use the sample size.
  • fin = [[1,2910],[2,45]] is the fingerprint, represented by a list of tuples. This means 2910 symobols appeared exactly once, and 45 symbols appeared exactly twice.
  • entropy.estimate(fin) produces the entropy estimate using our rate-optimal estimator. The unit of output is bits.

Functions

Samples statistics

The entropy estimator requires an input of fingerprint. In case of raw data, we have functions for this statistics:

  • sample_to_fin(sample) returns fingerprints of samples
    • input: a list of samples
    • output: a list of tuples (fingerprints)
  • hist_to_fin(hist) returns fingerprints from histogram (number of appearances/frequency counts)
    • input: a list of frequencies
    • output: a list of tuples (fingerprints)

Other estimators

We implemented classical entropy estimators, and outputs are all in bits:

  • entropy.estimate_plug(fin): plug-in estimator
  • entropy.estimate_Miller_Madow(fin): Miller-Madow estimator

Comprehensive script

We provide main.py as an example script to tune parameters in our entropy estimator. It can also be used directly as a estimator software working on data files!
Type python3 main.py -k 100000 -fin fin_sample.txt or python3 main.py -k 100000 -hist hist_sample.txt to experiment on the fingerprint "fin_sample.txt" or histogram "hist_sample.txt" respectively, which are both the statistic of 3,000 samples generated by the uniform distribution over 100,000 symbols.

Program arguments

Main arguments:

  • -k int: set alphabet size.
  • -fin str: set fingerprint input file. Each line of file consists of two numbers: frequency j, the number of symbols that appears exactly j times.
  • -hist str: set histogram input file. Each line of file consists of only one number: frequency j. Symbols are not needed.
  • k must be provided. Either fin or hist must be provided. If both fin and hist are provided, only fin will be read.

Optional arguments:

  • -L int: set polynomial degree. Default L=1.6 log k.
  • -M float: set the right endpoint of approximation interval. Default M=3.5 log k.
  • -N int: set the threshold to apply the polynomial estimator. Default N=1.6 log k.
  • The parameters above can be combined, e.g., ./entropy -k 100000 -fin fin_sample.txt -L 18 -M 40 -N 18.
  • --help or -h: see the list of arguments.

C++

Compile and run

Check out all source code, including the Makefile and .txt files.

Type make to compile the sources, you will get executable file "entropy".

Type ./entropy -k=100000 -fin=fin_sample.txt or ./entropy -k=100000 -hist=hist_sample.txt to experiment on the fingerprint "fin_sample.txt" or histogram "hist_sample.txt" respectively, which are both the statistic of 3,000 samples generated by the uniform distribution over 100,000 symbols.

Program arguments

Main arguments:

  • -k=number: set alphabet size.
  • -fin=filename: set fingerprint input file. Each line of file consists of two numbers: frequency j, the number of symbols that appears exactly j times.
  • -hist=filename: set histogram input file. Each line of file consists of only one number: frequency j. Symbols are not needed.
  • k must be provided. Either fin or hist must be provided. If both fin and hist are provided, only fin will be read.

Optional arguments:

  • -L=number: set polynomial degree. Default L=1.6 log k.
  • -M=number: set the right endpoint of approximation interval. Default M=3.5 log k.
  • -N=number: set the threshold to apply the polynomial estimator. Default N=1.6 log k.
  • The parameters above can be combined, e.g., ./entropy -k=100000 -fin=fin_sample.txt -L=18 -M=40 -N=18.
  • Type ./entropy -help or ./entropy -h to see the list of arguments.

More for developers

For developer who want to write a new test scratch to test the entropy estimator, follow the example main_test.cpp. After make you will also see the executable file "test".

Entropy estimator class: Entropy

Work flow:

  1. Set alphabet size k and several parameters:
  • setDegree(int L): the degree of polynomial is L.
  • setInterval(int M): the approximation interval is [0,M/n], where n is the sample size.
  • setThreshold(int N): the threshold to use polynomial estimator is when the histogram is at most N.
  1. Input fingerprint or histogram. Use one of the following:
  • setFin( filename ): each line of file consists of two numbers: frequency j, the number of symbols that appears exactly j times.
  • setFin( freq, count ): input two vectors of the same length: freq[i] represents frequency, and count[i] counts the number of symbols that appear exactly freq[i] times.
  • setHist( filename ): each line of file consists of one number: frequency j. Symbols are not needed.
  • setHist( freq ): input one vector with frequencies.
  1. Output various estimates:
  • estimate(): our polynomial estimator
  • estimate_plug(): plug-in estimator
  • estimate_Miller_Madow(): Miller-Madow estimator

Synthetic data experiments

For those who want to generate synthetic samples within C++, you can refer to the standard random number generator facilities. There are examples to generate random numbers according to different types of distributions. If the distribution is not yet provided, you can use the general discrete distribution. There are several ways to construct discrete distributions.

Reference

For detailed explanation of parameters, please refer to our paper Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation. The parameters described in the paper are: L=c0 log k, M=c1 log k, N=c2 log k. As in the paper, the default values are L=1.6 log k, M=3.5 log k, N=1.6 log k.