/AGE

Active Learning for Graph Embedding

Primary LanguagePythonMIT LicenseMIT

This program (AGE) implements an active learning for graph embedding framework, as proposed in the following paper.
If you use it for scientific experiments, please cite this paper:
@article{DBLP:journals/corr/CaiZC17,
  author    = {HongYun Cai and
               Vincent Wenchen Zheng and
               Kevin Chen{-}Chuan Chang},
  title     = {Active Learning for Graph Embedding},
  journal   = {CoRR},
  volume    = {abs/1705.05085},
  year      = {2017},
  url       = {https://arxiv.org/abs/1705.05085},
  timestamp = {Mon, 15 May 2017 06:49:04 GMT}
}

The code has been tested under Ubuntu 16.04 LTS with Intel Xeon(R) CPU E5-1620 @3.50GHz*8 and 16G memory.


============== *** Installation *** ============== 
python setup.py install

============== *** Requirements *** ============== 
tensorflow (>0.12)
networkx
Graph convolutional network (Kipf and Welling, ICLR 2017): https://github.com/tkipf/gcn

============== *** Data *** ==============
In order to use your own data, you have to provide

an N by N adjacency matrix (N is the number of nodes),
an N by D feature matrix (D is the number of features per node), and
an N by E binary label matrix (E is the number of classes).
Have a look at the load_data() function in utils.py for an example.

In this example, we load citation network data (Cora, Citeseer or Pubmed). The original datasets can be found here: http://linqs.cs.umd.edu/projects/projects/lbc/. In our version (see data folder) we use dataset splits provided by https://github.com/kimiyoung/planetoid (Zhilin Yang, William W. Cohen, Ruslan Salakhutdinov, Revisiting Semi-Supervised Learning with Graph Embeddings, ICML 2016) to load the whole dataset, and use the same test data as theirs.

The validation node instances are randomly sampled from the non-test nodes set. We randomly generate 10 validation sets for each dataset and the node indexes are stored in "source/datasetname/val_idxa.txt" (where a is the validation set id, range within [0,10]).

The initially labeled nodes are randomly sampled from the non-test and non-train nodes set. Given the C (the number of classes in this dataset) and a predefined L, AGE will randomly sample L nodes from each class as the initially labeled nodes (so there are C*L initial labeled nodes in total). 

============== *** Run the Program *** ==============
1. First generate the graph centrality score for each node as follows.
Command: 
python get_graph_centrality.py datasetname 
e.g., python get_graph_centrality.py citeseer
Parameteres:
datasetname: denote the dataset to process
Output:
The centality scores for each node (same order as in graph) are stored in "res/datasetname/graphcentrality/normcen"
Note:
We adopt PageRank Centrality in this work. You can try other centrality measurements by modifing function "centralissimo()" in file "get_graph_centrality.py".
2. Run the AGE algorithm to actively select nodes to label during the graph embedding process and record the MacroF1 and MicroF1 for node classification
Command:
python train_entropy_density_graphcentral_ts.py validation_id nb_initial_labelled_nodes_per_class class_nb datasetname
e.g., python train_entropy_density_graphcentral_ts.py 0 4 6 citeseer
Parameters:
validation_id: the validation set id, refering to the id listed in "source/datasetname/val_idxa.txt"
nb_initial_labelled_nodes_per_class: number of the initial labelled nodes per class, we use four in this work
class_nb: number of class for each dataset
datasetname: the name of the dataset to process