CoRel: Seed-Guided Topical Taxonomy Construction by Concept Learning and Relation Transferring

The source code used for our KDD'2020 paper.

Requirements

GCC compiler (used to compile the source c file): See the guide for installing GCC.

Datasets

Due to the constraint of size, we provide the link of our datasets in the following links, please copy the files to ${dataset}/.

DBLP & Yelp.

Run the Code

Preprocessing

If you are using our own datasets, you can skip this step. Otherwise, please first use this preprocessing tool to extract a sentences.json for your own corpus, and then run

bash preprocess.sh

to generate the index files.

Concept Learning for topic nodes

cd c
bash run_emb_part_tax.sh

This step compiles the source file and trains embedding for concept learning. The --topic_file in the script is used to specify the seed taxonomy. As an example, you can set ${topic_file} to be topics_field.txt for dataset dblp and topics_des.txt for dataset yelp. These topic files are already provided in the datasets. If you want to specify your own seed taxonomy, just feel free to create a new file using the format topics_{xxx}.txt. Each line starts with a parent node (with the root node being ROOT), and then followed by a tab. The children nodes of this parent is appended and separated by space. Generated embedding file is stored under ${dataset}.

Relation Transferring

A jupyter notebook version is available for this step. You can change the dataset and the topic_file name in main.ipynb. As another option, you can use the following python programme to generate the results.

cd ..
python main.py --dataset ${dataset} --topic_file ${topic_file}.txt

This step completes the taxonomy structure and outputs keywords for each node in the taxonomy. As an example, for DBLP dataset, you can run

python main.py --dataset dblp --topic_file topics_field.txt

Concept learning for all nodes

python generate_bash.py

This command generates a script from c/template.sh that can recursively run embedding training for all topics and subtopics.

cd c
bash run_emb_full_tax.sh

This command will generate the final topical taxonomy under a result directory.

Results

Results for each topics are generated at result\${dataset}\${topic}\subtopics_for_${topic}.txt. E.g., each line in result\DBLP\data_mining\subtopics_for_data_mining.txt is one subtopic of data mining (including a cluster of words).

teapot123/CoRel