/OAG-taxo

Primary LanguagePythonMIT LicenseMIT

OAG-taxo

Introduction

This project aims to implement several methods for taxonomy expansion. We provide three methods: Bilinear, TaxoExpan, and TaxoEnrich. Also, we provide inference on AI taxonomy via pre-trained models on Computer Science taxonomy.

This work is mainly based on the work of TaxoEnrich. We choose it because it has many available models and trainers.

Environment

You need to prepare an environment of cuda10 + dgl0.4.0. It can be only used on the Graphics Card below 30.

Install requirements.txt via pip install -r requirements.txt (test with Python 3.7)

Run the following command before running any methods.

export PYTHONPATH="`pwd`:$PYTHONPATH"

Data Preparation

If you want to try the dataset of Mag-CS, Mag-full, and our own dataset on these models, we have prepared the dataset here. You can put MAG-CS/MAG-full/OAG-AI folder in the data directory in the project root directory.

If you want to try other datasets, you can follow the methods mentioned in Taxoenrich. In short, you need to prepare the x.terms file, x.taxo file. Next, run the embedding_generation.py and generate_dataset_binary.py, then you can get the x.bin file for training.

For example,

python data_creation/embedding_generation.py --dataset oag-ai
python data_creation/generate_dataset_binary.py -d data/OAG_AI -t "Artificial Intelligence" -p 0

Train the model

try: python train.py --config config-file, in which config-file has been prepared in config_files folder. Run the enrich model with config.test.enrich.json. Run the config.test.baseline.json for Billiear Model. Run the config.test.baselineextmn.json for TaxoExpan Model. For example,

python train.py -c config_files/MAG-CS/config.test.enrich.json

Infer the model

We provide inference methods for the Artificial Intelligence dataset from pre-trained models on Computer Science taxonomy.

python inner_infer.py --resume your_model_path_here --config config_files/MAG-CS/config.test.enrich.json

We provide our pre-trained models for you here

Config File Prepared

For example, in ./config_file/mag_cs, we introduce each file's usage:

config.test.enrich.json: TaxoEnrich method on completion task config.test.baseline.json: baseline Bilinear method on completion task config.test.baselineex.json: TaxoExpan method on expansion task config.test.baselineextmn.json: TaxoExpan method on completion task

config.valid.X.json means the corresponding infer config file for the X method and config.test.X.json

If you do not have enough GPU memory for training, decrease the batch size and the number of negative samples.

Report PDF

Share my report here.

References

[1] Jiaming Shen, Zhihong Shen, Chenyan Xiong, Chi Wang, Kuansan Wang and Jiawei Han ”TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced Graph Neural Network”, in Proc. 2020 Int. World Wide Web Conf. (WWW’20), Taipei, Taiwan, Apr. 2020.

[2] Minhao Jiang, Xiangchen Song, Jieyu Zhang and Jiawei Han, “TaxoEnrich: Self-Supervised Taxonomy Completion via Structure-Semantic Representations”, in Proc. The ACM Web Conf. 2022 (WWW’22), April 2022