This repo contains the code intended to support the developing process of my Master Degree Thesis. The same results are used in the article Zambra, Maritan, Testolin, "Emergence of network motifs in deep neural networks", Entropy, 2020, 22, 204, DOI: https://doi.org/10.3390/e22020204.
Motivated by recent publications concerning the development and formalisation of a theoretical framework in which Deep Learning may fit into, this thesis topic and work is finalised to investigate the evolution of the topology of a simple feedforward artificial neural network during the process of learning.
Network Science is thought to be an interesting standpoint. Inspired by Milo et al. (2002), a simple multilayer-perceptron model is inspected as breeding ground for the emergence of network motifs, once the model is trained on different sets of synthetic data and with different initialisation schemes (Saxe et al., 2013, Glorot and Bengio, 2010). The question addressed is whether one can appreciate the trace that the learning dynamics leaves in the form of functional modules. The neural network inspected is trained on a multi-label classification task using two different data sets. Another pivotal aspect is the influence of the initialization scheme in relationship with training efficiency. Turns out that non-trivial schemes, such as Orthogonal matrices and Glorot-Bengio schemes, could impress a sharper initial fingerprint to the initial motifs landscape, thus encouraging these motifs to develop.
An endearing biological resemblance with kinases dynamics in transduction networks may be appreciated (see Alon, 2006). Moreover, signature of diversity can be appreciated for different arrangments of data sets and initialisation schemes.
srcs
source codes. The main.py
scripts launches what necessary
DataSets
folder, which contains the scripts to generate the two data sets used. Number of data examples and other details are hard-coded. In addition, serialized files of previously created data sets are present, in order to fetch them in the simulations.utils
contains the modules I have written to accomplish the task. Other than Machine Learning, as Keras, and standard scientific computing libraries, NumPy, Pandas and so forth, all the code is written from scratch, from the generation of synthetic data to the utilities to read the results of the FANMOD motifs mining program (see below) and results production.Results
. Here the directory relative to one seed is created. Inseed_nseed
the directorysynds
is created if the user wants to perform the analyses on the sets of synthetic data. Otherwise, by settingflow['mnist'] = True
, the analyses are carried out for the MNIST data set. Thus the directorymnsit
is created in place ofsynds
. The subsequent hierarchy is the same, however. Insynds/
the results created for each initialisation scheme are stored in proper directories, alongside with the directories that contain graphical results and serialized objects. In each directory containing the results for the initialisation schemes, results relative to the data set are stored, that is, the.h5
model from Keras and the.txt
graph data structures used by the FANMOD program.
This hierarchy is automatically created by a program run, and files are saved in a consistent fashion, in such a way to fetch the needed files in the same run or later. Note that in the main.py
script the execution is modular: The user instructs the program whether to repeat the any of the stages in the computational workflow. Accordingly, files are created again or fetched.
In the main.py
file, the flow
dictionary is declared, and then passed as unpackable object. Herein, instructions about the program flow are contained. That is, these instruct the program whether to perform certain chunks of execution, e.g. model initialization, training, preprocess and so forth.
The FANMOD software has the advantages to implement wighted (colored) network motifs mining and has a graphical user interface. It is not possible to call it from command line, hence to integrate it in the workflow of the code contained in this repository. The user must tune the workflow instructions in the aforementioned dictionaries in such a way to create the graph files amenable for FANMOD in a first run, run FANMOD so that this latter can generate the needed output files and in a second time the code should be run again, executing the postprocess stage.
In the analyses presented FANMOD has been used setting the number of random networks to compare the actual model with to be 100. Check the CSV
option in the Output file
menu. Check Directed
(default choice) in the Input file
menu. Browse the text file produced by the code run and set the name of the output file to env_weight_size_out.csv
. env
should be replaced by init
, tree
, clus
, respectively the initial configuration, binary tree and independent clusters data sets. Replace weight
with u
or w
, respectively unweighted or weighed network. In this second case check the Colored edges
option. Replace size
with s4
or s5
, respectively four or five vertices motifs. As an example, the input file for the case of binary tree data set accounting for a weighted graph will be tree_w_Graph.txt
and it is created automatically by the program. Assume that one wants to mine four nodes motifs in this case. Then the user should call the output file (and put it in same directory) tree_w_s4_out.csv
.
Note that the directory where this input file is stored is ..\srcs\Results\seed_n\synds\init_scheme\dataset\
. Here there are the .h5
Keras model export, the FANMOD input file and the output file generated running this latter. In the postprocessing stage, the program is thought to fetch these output files from this directory.
The OO-paradigm has been exploited but not abused. For the purposes of many of the functions written, it has been realized that simple function definition were the best trade-off between code readability, architecture clarity and modularity. The definition and usage of reduntant object types in this case would have resolved in an over-complex classes landscape, thus giving problems in an hypothetic further re-use.
It would be interesting to do a comparison with real-world data sets related results. The code has been modified to implement in a painless way the training on thte MNIST data set. However, the major difficulty is the fact that the networks used for the artificial data sets are inevitably different from the net that would absorb the MNIST data samples and classify them, due to the different input and output dimensions.
However, another improvement front related to this necessity would be the scalability of the approach presented. Scalability is the Achille's heel of the whole work. Larger graphs require a larger resources deployment. One interesting direction for further research is indeed devoted to this scalability issue.
The FANMOD program has been used, in that it has the capability of inspecting colored networks, as it is the case. See the linked page for relevant papers, executable, sources and license.
Published in Entropy