Repository for my applied mathematician master thesis
In my master thesis I examined how central nodes of a temporal network can be predicted in advance. I used this repository for computing centrality measures on dynamic graphs. The code is based on the LEMON 1.3.1 C++ graph library.
In this framework dynamic graphs are partitioned into subgraphs related to time intervals of given length. Then the following centrality measures and graphs statistics can be compared as the graph evolves.
- Indegree
- Outdegree
- Negative beta measure
- PageRank
- SALSA
The sources of the centrality features can be found in the centrality folder.
You can clone the repository with the following command
git clone https://github.com/ferencberes/msc-thesis.git
Then change directory into the repository
cd msc-thesis
The project is developed under Unix environment. There are some dependencies that can be installed from a custom script, but the following requirements are needed if you want to use this repository.
- g++ 4.8.2 version or higher
- SCons 2.3.4 or higher
SCons compiles the C++ sources of the project.
There are several dependencies that are needed for this repository.
- LEMON 1.3.1 (C++ graph library)
- CMake 3.2.1 or higher (for compiling Lemon source code)
- log4cpp (C++ logging library)
They can be installed from a custom script with the following command.
./manage/install.sh
This script installs all formerly listed dependency into 'src/dep' folder.
Python scripts are mostly related to visualization and file formatting tasks. Thus, you can run the C++ code without these Python modules. However, if you want to use custom configuration files for the jobs then the following modules must be installed.
- matplotlib
- numpy
- json
- itertools
- math
- copy
- datetime
- subprocess
- jinja2
- pytest
Execution with custom configuration files is explained in the 'Execution using json config files' section.
After installing C++ dependencies the source code must be compiled and tested. In this project SCons is used for compiling. The repository contains some gtest that check whether the C++ sources are correct. With the execution of the following code you can compile the code and run the tests afterwards.
./manage/compile_and_test_all.sh
The Python scripts can be tested as well. To use this feature you must install pytest along with other modules listed in the 'Python dependencies' section. Then you can run the following command.
./src/scripts/run_pytest.sh
Here, I present how you can use the scripts in this repository if you want to run some tests on dynamic graphs.
In the examples, we will use the Enron graph as input from Koblenz Network Collection. This is a temporal directed graph related to an email network. You can find the compressed lemon graph format 'enron.lgf.zip' here.
The only compatible file format is lemon graph format(.lgf). It is file format supported by LEMON. For each edge the data must contain the source and target vertex, and the Unix epoch timestamps.
For example, here is a portion of the input:
@arcs
labels time
1000 966 0 1004411463
1000 966 1 1004411913
100 100 2 1010781654
100 10756 3 1010687491
...
There is a script that you can use to parse .tsv or .csv files into .lgf files. You only need to give the positions for source, target and timestamps in the original file. If you execute this command you will get a hint for parameter usage.
./src/scripts/parsers/timeline_tsv_to_lgf.py
For example, I converted the original Enron .csv input format to .lgf with a similar command.
./src/scripts/parsers/timeline_tsv_to_lgf.py ./enron.csv ./enron.lgf 0 1 3
After you have compiled the C++ sources successfully you can compute centrality measures on dynamic graphs. You have to decompress 'enron.lgf.zip' before running any of the following commands.
First execute a command which only loads the graph and extracts some general statistics.
src/scripts/experiment/runTemporalLemonTest.sh /home/your-user/git/msc-thesis/enron.lgf /home/your-user/git/msc-thesis/enron_test
The former command also gives a hint about the full variable setting.
Usage: <input_lgf> <output_folder> <start_time> <delta> <centrality_prev_interval_count> <combine_factor> <graph_stat_prev_interval_count> <enable_multi_edges> <max_num_of_steps> <topK> OPTIONS:(-beta, -degree, -pr, -salsa)
Here you have to set several parameters.
- input_lgf: absolute path for input graph in .lgf format
- output_folder: absolute path for the output directory
- start_time: the graph analysis will start from this Unix epoch timestamp
- delta_time: the length of the time intervals in seconds
- centrality_prev_interval_count: compute centrality with the edges of the last given number of intervals
- combine_factor: set the default -1 value.
- graph_stat_prev_interval_count: set the default -1 value
- enable_multi_edges: enable multiple edges (true/false)
- max_num_of_steps:
- topK: extract topK central nodes for each measure
- centrality measures: you can set which measures are computed
- degree: indegree and outdegree
- pr: PageRank (10 iteration, dampening factor is 0.85)
- salsa: SALSA authority and hub scores (10 iteration)
- beta: negative beta measure or "Markovian indegree" (For details see this article)
For Enron graph execute the following command.
./src/scripts/experiment/runTemporalLemonTest.sh /home/your-user/git/msc-thesis/enron.lgf /home/your-user/git/msc-thesis/enron_test 915148802 7884000 0 -1.0 -1 true 20 10 -degree -pr -salsa -beta
After the job ended there is a json file, called intervals.json, in the the output folder. It contains all information and results about the given job. The top_k central nodes for each centrality measure are included for all intervals as well.
Documentation for this section is coming soon...