This repository contains a code developed for the paper on Statistical Inference of Minimally Complex Models available in arXiv:2008.00520. The code performs an exhaustive search for the best Minimally Complex Spin Model (MC-Spin Model) on a given basis.
The code go through all possible MC-spin models of a given rank r
, where an MC-spin model is defined by a given partition of the r
basis operators provided (see paper). The comparison between models is based on their evidence (posterior probability that the model produces the data, integrated over the parameter values). The selected model is the one with the largest evidence.
One big advantage of this family of models (the MC-spin models) is that the computation of the evidence doesn’t require fitting the parameters of the model, which allows to significantly accelerate the comparison between models. The limiting factor of this code is the exhaustive search for the best model as the space of models is exponentially increasing with r
.
For a given number of r
of basis elements, the number of possible partitions the r
basis operators is equal to the Bell number of r
, which grows exponentially with r
. The running time of the program also grows linearly with the number of different states observed in the system (so approximatively linearly with the number of datapoints in the dataset). For these reasons, this code is good for use on small systems, typically with r <~15
variables. Note that for r~15
, the code will take several days to perform the exhaustive search.
The code uses the C++11 version of C++.
To compile: g++ -std=c++11 -O3 main.cpp Data_Operations.cpp LogE.cpp LogL.cpp Complexity.cpp Best_MCM.cpp Basis_Choice.cpp
To execute: ./a.out
All useful functions that can be called from int main()
are declared at the beginning of the main.cpp
file and described in the sections below. For hands-on and simple tests of the program, check the examples in the function int main()
of the main.cpp
file. (Add details on the dataset taken as an example).
This code is an open source project under the GNU GPLv3.
Before compiling specify the following global variables:
const unsigned int n
, with the number of variables of the dataset; This number can also be the number of basis operators in the provided basis; it must not be smaller than the number of basis operators.const string OUTPUT_directory
, with the name of the output directory; All the generated files will be placed in this folder.const string datafilename
, with the location and name of the binary datafile.- (Optional)
const string basis_IntegerRepresentation_filename
, with the location and name of the input file containing the basis element written in the integer representation. - (Optional)
const string basis_BinaryRepresentation_filename
, with the location and name of the input file containing the basis element written in the binary representation.
The element of the basis for building the Minimally Complex Model (MCM) has to be specified by the user before compiling the code.
The basis can be written “by hand” directly at the beginning of the int main()
function in uint32_t Basis_Choice[]
, or by using an input file (see subsections below).
If you don’t know which basis to use, you can run the minimally complex model algorithm on the original basis in which the dataset is written. This can be done by using the function list<uint32_t> Original_Basis()
to define the basis.
In general, we advise you to use the basis in which the dataset is the closest to be generated from an independent model (see discussion in the paper). Finding this basis can be done using the heuristic/exhaustive search algorithm available separately here.
In the code, a basis is stored as a list of integers list<uint32_t> Basis
, where each integer defines a basis operator (see explanation below).
Basis elements are spin operators that are all independent from each others (see paper). You can use the function (to come) to check if the elements you have specified in list<uint32_t> Basis
actually form a basis, i.e. if the set is only composed of independent operators.
Each element of the basis must be a spin operator. A spin operator is the product of a subset of spin variables (see paper). For instance, Op = s1 * s2
is a spin operator (physically, it is associated to a pairwise interactions between s1
and s2
); Op = s1*s2*s3
is also a spin operator (this time associated to a three-body interaction between s1
, s2
and s3
).
In the code, spin operators are encoded on a binary number of n
bits, where n
is the number of spin variables in the system (which you must define in the file data.h
). The binary representation of a given operator has a bit 1
for each spin included in the operator, and 0
for all the other spins. Importantly, spin variables are numbered from the right to the left.
For instance, take the operator Op = s1 s2
, this operator would be represented in the code by the binary number Op = 000000011
(for n=9
). Finally, to simplify the definition of a spin operator in the code, you can directly use the integer corresponding to this binary number. For instance, to defined the operator Op = s1 s2
, you can use the binary representation Op = 000000011
or the integer representation Op = 3
.
Example: the three representations of a spin operator: --> Op = s1 s2 Spin operator --> Op = 000000011 Binary representation --> Op = 3 Integer representation ( 000000011 = 3 )
Finally, the number of elements in the basis must be at most equal to the number n
of variables in the system. Note that the rank r
of the basis can also be smaller than n
. In this case, the code will automatically truncate the data to reduce it to the sub-space defined by the r
specified basis elements.
Note, in the example above, that the convention taken for writing the spin operators is to label the spin variables from the right to the left in the binary representation. This is just a convention and doesn't change the ordering of the spin variables from their order in the dataset, i.e., the first variable on the left in the binary representation of the operators is the same as the first variable on the left in the dataset provided as input file.
However, an important point to be able to interpret the results of the program, is that we adopted the same convention for the new basis: the first operator provided in the basis will correspond to the variable sigma1
, which will be the most on the right in the transformed dataset (see example in the next section).
The basis can be specified by hand directly at the beginning of the int main()
function in uint32_t Basis_Choice[]
. In this case, you we advise you to use the integer representation of the basis operators. In the example provided in the int main()
function: uint32_t Basis_Choice[] = {36, 10, 3, 272, 260, 320, 130, 65, 4}
defines a basis with 9
independent operators. Here are the different representations for these spin operators: first, the integer representation; second, the binary representation; third, the corresponding spin operators; and finally representations of these operators in the new basis:
36 000100100 s3 s6 -->> new basis : 000000001 1 sigma1 10 000001010 s2 s4 -->> new basis : 000000010 2 sigma2 3 000000011 s1 s2 -->> new basis : 000000100 4 sigma3 272 100010000 s5 s9 -->> new basis : 000001000 8 sigma4 260 100000100 s3 s9 -->> new basis : 000010000 16 sigma5 320 101000000 s7 s9 -->> new basis : 000100000 32 sigma6 130 010000010 s2 s8 -->> new basis : 001000000 64 sigma7 65 001000001 s1 s7 -->> new basis : 010000000 128 sigma8 4 000000100 s3 -->> new basis : 100000000 256 sigma9
The following functions allow you to define a basis from an input file.
list<uint32_t> Read_BasisOp_BinaryRepresentation()
, if operators are written using the binary representation (see example file in theINPUT
folder); the location of the file must be specified indata.h
in the variablebasis_BinaryRepresentation_filename
.list<uint32_t> Read_BasisOp_IntegerRepresentation()
, if operators are written using the integer representation (see example file in theINPUT
folder); the location of the file must be specified indata.h
in the variablebasis_IntegerRepresentation_filename
.
For any of these two functions, operators should be written in one single column in the file.
To check the information about a basis, you can print it in the terminal using the function void PrintTerm_Basis(list<uint32_t> Basis_li)
.
The function map<uint32_t, unsigned int> read_datafile(unsigned int *N)
reads the dataset available at the location specified in the variable const string datafilename
in data.h
. The dataset is then stored in the a structure map<uint32_t, unsigned int> Nset
that map each observed states to the number of times they occur in the dataset.
The function map<uint32_t, unsigned int> build_Kset(map<uint32_t, unsigned int> Nset, list<uint32_t> Basis, bool print_bool=false)
changes the basis of the dataset from its original basis (or the one in which Nset
provided as an argument is written) to the one provided as the argument Basis
.
Two functions are available to perform a search among MC-Spin Models:
map<uint32_t, uint32_t> MCM_GivenRank_r(map<uint32_t, unsigned int > Kset, unsigned int r, unsigned int N, double *LogE_best)
compares all the MCM of rank r, based on ther
first elements of the new basisBasis_li
(the one used to build Kset);map<uint32_t, uint32_t> MCM_allRank(map<uint32_t, unsigned int > Kset, unsigned int N, double *LogE_best)
compares all the MCM based on ther
first elements of the new basisBasis_li
for allr=1
to the size ofBasis_li
.
These two functions enumerate all possible partitions of a set using variantes of the algorithm E described in [D.E. Knuth, The Art of Computer Programming, Volume 4, Combinatorial Algorithms: Part 1 (Addison-Wesley Professional, 2011)]. The algorithm efficiently generates all set partitions in Gray-code order.
In this program, all the compared MC-spin models correspond to a partition of the basis operators. Once the dataset converted in this basis, an MC-spin model will be encoded on m
digits, where .
In the code, partitions of the r
digits are encoded in two different ways.
(Explains how to read a MCM)
The function void PrintTerminal_MCM_Info(map<uint32_t, unsigned int > Kset, unsigned int N, map<uint32_t, uint32_t> MCM_Partition)
prints in the terminal information about the MC-spin model given as the argument MCM_Partition
.
(Details what is printed)
Users can also get direct information about any subcomplete part (SC-part) of an MCM with the functions:
double LogL_SubCM(map<uint32_t, unsigned int > Kset, uint32_t Ai, unsigned int N, bool print_bool = false)
returns the log-likelihood of the SC-part whereKset
is the dataset written in the new basis, and whereAi
is the binary representation of the SC-part (see sectionEncoding of MC-Spin Models
).double LogE_SubCM(map<uint32_t, unsigned int > Kset, uint32_t Ai, unsigned int N, bool print_bool = false)
returns the log-evidence of the SC-part whereKset
is the dataset written in the new basis, and whereAi
is the binary representation of the SC-part (see sectionEncoding of MC-Spin Models
).double ParamComplexity_SubCM(unsigned int m, unsigned int N)
returns the model complexity of the SC-part due to the number of parameters in the part; this is the first complexity term appearing the Minimum Description Length principle (which is of the order ofO(log N)
whereN
is the number of datapoints -- see paper).double GeomComplexity_SubCM(unsigned int m)
returns the geometric complexity of the SC-part; this is the secdon complexity term appearing the Minimum Description Length principle (which is of the order ofO(1)
-- see paper).
Users can also get direct information about the MCM with the functions:
double LogL_MCM(map<uint32_t, unsigned int > Kset, map<uint32_t, uint32_t> Partition, unsigned int N, bool print_bool = false)
returns the log-likelihood of the MCM defined byPartition
;double LogE_MCM(map<uint32_t, unsigned int > Kset, map<uint32_t, uint32_t> Partition, unsigned int N, bool print_bool = false)
returns the log-evidence of the MCM defined byPartition
;double Complexity_MCM(map<uint32_t, uint32_t> Partition, unsigned int N, double *C_param, double *C_geom)
place the parameter complexity and the geometric complexity of the MCM model defined inPartition
respectively at the addresses*C_param
and*C_geom
; return the total complexity of the model.
Input files must be stored in the INPUT
folder, you must provide the following input files:
- a binary datafile. The name of the datafile must be specified in
data.h
in the variabledatafilename
. Datapoints must be written as binary strings of 0’s and 1’s encoded on at leastn
bits (with no spaces between the bits), wheren
is the number of spin variables specified indata.h
. The file must contain one datapoint per line — see example in the INPUT folder. - (optional) a basis input file (see section "Reading the basis from an input file” above).
All the output files will be stored in the output folder whose name is specified in data.h
.