/C3KG

Primary LanguagePythonApache License 2.0Apache-2.0

C3KG

Introduction

Existing commonsense knowledge bases often organize tuples in an isolated manner, which is deficient for commonsense conversational models to plan the next steps. To fill the gap, we curate a large-scale multi-turn human-written conversation corpus, and create the first Chinese commonsense conversation knowledge graph which incorporates both social commonsense knowledge and dialog flow information. To show the potential of our graph, we develop a graph-conversation matching approach, and benchmark two graph-grounded conversational tasks. The paper "C3KG: A Chinese Commonsense Conversation Knowledge Graph" has been accepted by Findings of 60th Annual Meeting of the Association for Computational Linguistics(Findings of ACL 2022). For details, https://aclanthology.org/2022.findings-acl.107/

If you use our codes or your research is related to our paper, please kindly cite our paper:

@inproceedings{li2022c3kg,
  title={C3KG: A Chinese Commonsense Conversation Knowledge Graph},
  author={Li, Dawei and Li, Yanran and Zhang, Jiayi and Li, Ke and Wei, Chen and Cui, Jianwei and Wang, Bin},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
  pages={1369--1383},
  year={2022}
}

Resource Released

We put all of our released resource here, including C3KG, ATOMIC_ZH and CConv dataset

Quick Start

Data and Models Preparation

  • Download ATOMIC2020 dataset and put all of three data files(train.tsv, test.tsv, dev.tsv) into ./data:
wget https://ai2-atomic.s3-us-west-2.amazonaws.com/data/atomic2020_data-feb2021.zip
unzip atomic2020_data-feb2021.zip
cd atomic2020_data-feb2021
cp train.tsv ../data/
cp test.tsv ../data/
cp dev.tsv ../data/
  • Download LTP4 toolkit(here we use Base2 model). Create ./model and put the Base2 model into it.
wget http://39.96.43.154/ltp/v3/base2.tgz
tar -xzvf base2.tgz
mkdir model
mv Base2 ./model/
  • Download our SBERT-ATOMIC semantic similarity model here and put it into ./model.

Data Preprocess

  • Rewrite the request_dev() function in ./preprocess/get_trans.py using any translation model or API:
def request_dev(query):
    # rewrite using any translation model or API
    raise NotImplementedError("rewrite using any translation model or API")
  • After that, run preprocess.sh:
chmod 777 preprocess.sh
./preprocess.sh
  • Or you can use the translated ATOMIC_Chinese.tsv, head_shortSentence.csv,head_phrase.csv here directly.

C3KG Construction

  • To get C3KG, run construct.sh, note that we put the CConv dataset here:
chmod 777 construct.sh
./construct.sh

Licence

  • Our dataset is licensed under the CC BY 4.0 and our code is licensed under the Apache License 2.0.