/aspem

Code and data for paper AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks

Primary LanguageC++Apache License 2.0Apache-2.0

AspEm

This repository provides codes and data for the paper:

AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks
Yu Shi, Huan Gui, Qi Zhu, Lance Kaplan, and Jiawei Han.
In Proceedings of the 2018 SIAM International Conference on Data Mining, SIAM, 2018.

Particularly, it includes (1) a reference implementation of incompatibility measure, (2) ad-hoc implementations of the single-aspect embedding algorithm for the datasets used in the paper, (3) the IMDb dataset (the full DBLP dataset is excluded from this repository due to its file size), and (4) the class labels used in the DBLP classification tasks.

Basic Usage

Input

  1. The supported input HIN file should contain all edges of the input HIN. Each line corresponds to an edge, with the format

     node_1 node_2 edge_weight edge_type
    

    Note that node_1 and node_2 should be in the form

     node_type:node_name
    

    An example input HIN file can be found at data/imdb/imdb.hin.

  2. Additionally, to run the ad-hoc implementation of the single-aspect embedding algorithm for star-schema datasets, one should also have a file of all center nodes (e.g., data/imdb/movie.node) and a file of all attribute nodes as input (e.g., data/imdb/uadg.node).

Execute

To measure the incompactibility of all base aspects in an HIN:

$ python src/calc_base_aspect_inconsistency.py --input $input-hin-file --output $base-aspect-inc-file [optional: --sample-rate $sample-rate] 

To aggregate incompactibility for all base aspects from the result of the previous step:

$ python src/agg_aspect_inconsistency.py $base-aspect-inc-file 

As an exmaple, to calculate the incompatibility of each aspect of the IMDb dataset, execute the following commands sequentially:

$ python src/calc_base_aspect_inconsistency.py --input data/imdb/imdb.hin --output data/imdb/imdb_base_aspect_inc.csv 
$ python src/agg_aspect_inconsistency.py $data/imdb/imdb_base_aspect_inc.csv 

To execute the ad-hoc implementation of the embedding algorithm, one should makefile in the corresponding source code directory in src/, and then execute the binary code in its bin/. The argument -types specifies the attribute node types involved in the current aspect with the following mapping: in IMDb -- u for user, a for actor, d for director, g for genre; in DBLP -- a for author, p for reference, v for venue, w for term, y for year.

As an example, to embed the IMDb network with only attribute node types user and director, execute the following commands sequentially:

$ cd src/emb_imdb/; make; cd ../..
$ ./src/emb_imdb/bin/emb_imdb  -types ud -hin data/imdb/imdb.hin -center data/imdb/movie.node -attribute data/imdb/uadg.node -output data/imdb/attribute.emb -output-center data/imdb/center.em``

Class Labels for DBLP Classification

In the DBLP experiment of the paper, two classification tasks were conducted based on the two class label files in

data/class_label/

Citing

If you find PReP useful for your research, please consider citing the following paper:

@inproceedings{shi2018aspem,
author = {Shi, Yu and Gui, Huan and Zhu, Qi and Kaplan, Lance and Han, Jiawei},
 title = {AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks},
 booktitle = {Proceedings of the 2018 SIAM International Conference on Data Mining},
 year = {2018},
 organization={SIAM}
}

Miscellaneous

Please send any questions you might have about the codes and/or the algorithm to yushi2@illinois.edu.

Note: This is only a reference implementation of the AspEm algorithm. As discussed in the paper, AspEm is a flexible framework and one can choose their favorite network embedding algorithm to embed every single aspect.