/LINE

LINE: Large-scale information network embedding

Primary LanguageC++

LINE: Large-scale information network embedding

##Introduction

This is the LINE toolkit developed for embedding very large-scale information networks. It is suitable to a variety of networks including directed, undirected, binary or weighted edges. The LINE model is quite efficient, which is able to embed a network with millions of vertices and billions of edges on a single machine within a few hours.

Contact: Jian Tang, tangjianpku@gmail.com
Project page: https://sites.google.com/site/pkujiantang/line
This work was done when the author was working at Microsoft Research

##Usage

We provide both the Windows and LINUX versions. To compile the souce codes, some external packages are required, which are used to generate random numbers for the edge-sampling algorithm in the LINE model. For Windows version, the BOOST package is used and can be downloaded at http://www.boost.org/; for LINUX, the GSL package is used and can be downloaded at http://www.gnu.org/software/gsl/

Network Input

The input of a network consists of the edges in the network. Each line of the input file represents a DIRECTED edge in the network, which is specified as the format "source_node target_node weight" (can be either separated by blank or tab). For each undirected edge, users must use TWO DIRECTED edges to represent it. Here is an input example of a word co-occurrence network:

good the 3
the good 3
good bad 1
bad good 1
bad of 4
of bad 4

Run

./line -train network_file -output embedding_file -binary 1 -size 200 -order 2 -negative 5 -samples 100 -rho 0.025 -threads 20
  • -train, the input file of a network;
  • -output, the output file of the embedding;
  • -binary, whether saving the output file in binary mode; the default is 0 (off);
  • -size, the dimension of the embedding; the default is 100;
  • -order, the order of the proximity used; 1 for first order, 2 for second order; the default is 2;
  • -negative, the number of negative samples used in negative sampling; the deault is 5;
  • -samples, the total number of training samples (*Million);
  • -rho, the starting value of the learning rate; the default is 0.025;
  • -threads, the total number of threads used; the default is 1.

Files in the folder

  • line.cpp, the souce code of the LINE;
  • reconstruct.cpp, the code used for reconstructing the sparse networks into dense ones, which is described in Section 4.3;
  • normalize.cpp, the code for normalizing the embeddings (l2 normalization);
  • concatenate.cpp, the code for concatenating the embeddings with 1st-order and 2nd-order;

Examples

An example to run the Youtube data set (available at http://socialnetworks.mpi-sws.mpg.de/data/youtube-links.txt.gz) is provided at the files train_youtube.bat/train_youtube.sh

##Citation

@inproceedings{tang2015line,
  title={LINE: Large-scale Information Network Embedding.},
  author={Tang, Jian and Qu, Meng and Wang, Mingzhe and Zhang, Ming and Yan, Jun and Mei, Qiaozhu},
  booktitle={WWW},
  year={2015},
  organization={ACM}
}