/bitinformation

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

publish test

Bit-Information-Content Tool

Bitshaper is a powerful tool for comparing data without regard to the bits that do not carry significant information. It also offers the possibility to explore the data for information content.

Quick start

This section shows some examples of the most common applications.

Install

python3 -m pip install bitinformation

Compare GRIB files

The comparison is made with a mask. This mask is calculated by the analyser, it can be read from the file or passed with the mask parameter.

Alt text

The usual case is that you will probably just want to compare two files. But this assumes that you already have a configuration.

bitshaper.py --compare file1.grib file2.grib --preprocessor raw

If you don't have a configuration file yet, you can create one by running the tool with --use-analyser --add-missing-parameters arguments.

bitshaper.py --compare file1.grib file2.grib --preprocessor raw --use-analyser --add-missing-parameters

Compute bitsPerValue in Simple Packing

test

Alt text

bitshaper.py --compare file1.grib file2.grib

Explore data

If you want to analyse the levels of each parameter, you must first define the primary key. This is a set of keys from different sources, e.g. Mars keys, Analyser and Preprocessor parameters. Then, in --value-key you define which values you want to record. The data then is exported to a CSV file.

params+="--primary-key short_name stream analyser_precision levelist preprocessor_bits_per_value "
params+="--value-key mask nbits_used"
params+="--csv $out_dir/explore.csv "
$tool $params --stats file1.grib file2.grib file3.grib

Algorithm behind the scene

The method calculates how much information content each bit in a number has. In essence, it is a statistical analysis of bit sequences. For example, according to this approach, random sequences of binary values and or a sequences of ones or zeros contain no information. Once a sequence has a structure, the information content is non-zero.

[0101010101010101] # low information content
[1111111111111111] # zero information content
[0000000000000000] # zero information content
[0000111100001111] # high information content

The following example explains the algorithm step by step without using formulas, when possible.

In the first step, assume there is a sequence S of 4-bit numbers. The sequence S is split into two arrays A and B. A is created by removing the last element from S and B, by removing the first element. The example below uses Python notatation to illustrate that.

S = [0, 1, 2, 3, 4, 5, 6, 7]

A = S[:-1] = [0, 1, 2, 3, 4, 5, 6]
B = S[1: ] = [1, 2, 3, 4, 5, 6, 7]

The next step is presented as a spreadsheet. In our example we work with 4-bit numbers, so we can identify each bit with the index i = [0, 1, 2, 3]. For illustration, we exapand our table with i and A, and i and B. A' and B' are the binary representations of the columns A and B, respectively. The columns A'[i] and B'[i] are the bits at the position i.

i A B A'=bin(A) B'=bin(B) A'[i] B'[i] seq = A'[idx]B'[idx]
0 0 1 0000 0001 0 1 01
0 1 2 0001 0010 1 0 10
0 2 3 0010 0011 0 1 01
0 3 4 0011 0100 1 0 10
0 4 5 0100 0101 0 1 01
0 5 6 0101 0110 1 0 10
0 6 7 0110 0111 0 1 01
1 0 1 0000 0001 0 0 00
1 1 2 0001 0010 0 1 01
1 2 3 0010 0011 1 1 11
1 3 4 0011 0100 1 0 10
1 4 5 0100 0101 0 0 00
1 5 6 0101 0110 0 1 01
1 6 7 0110 0111 1 1 11
2 0 1 0000 0001 0 0 00
2 1 2 0001 0010 0 0 00
2 2 3 0010 0011 0 0 00
2 3 4 0011 0100 0 1 01
2 4 5 0100 0101 1 1 11
2 5 6 0101 0110 1 1 11
2 6 7 0110 0111 1 1 11
3 0 1 0000 0001 0 0 00
3 1 2 0001 0010 0 0 00
3 2 3 0010 0011 0 0 00
3 3 4 0011 0100 0 0 00
3 4 5 0100 0101 0 0 00
3 5 6 0101 0110 0 0 00
3 6 7 0110 0111 0 0 00

The next stpe is groupping the table by (i, seq) columns and count the occurences. p is the probability with wich a sequence at bit position i occurs.

i seq count p = count/7
0 00 0 0.000
0 01 4 0.571
0 10 3 0.429
0 11 0 0.000
1 00 2 0.286
1 01 2 0.286
1 10 1 0.143
1 11 2 0.286
2 00 3 0.429
2 01 1 0.143
2 10 0 0.000
2 11 3 0.429
3 00 7 1.000
3 01 0 0.000
3 10 0 0.000
3 11 0 0.000

In the last step we compute the mutual information. To do that we take the columns i and p from the table and reshape them so that we have the probabilities for each sequence, i.e., p00, p01, p10, p11, in separate columns. This allows us to continue our example as a spreadsheet.

Formula below computes mutual information. It says how much information a bit contains.

M' = p00 * log(p00 / (p00 + p01) / (p00 + p10)) +
     p01 * log(p01 / (p00 + p01) / (p01 + p11)) +
     p10 * log(p10 / (p10 + p11) / (p00 + p10)) +
     p11 * log(p11 / (p10 + p11) / (p01 + p11))

M = M' / log(2)
i p00 p01 p10 p11 M
0 0.000 0.571 0.429 0.000 0.699
1 0.286 0.286 0.143 0.286 2.061
2 0.429 0.143 0.000 0.429 0.235
3 1.000 0.000 0.000 0.000 0.000