Code and data for the COLING 2020 paper "Try to Substitute: An Unsupervised Chinese Word Sense Disambiguation Method Based on HowNet". [Paper]
Please cite our paper if you find it helpful.
@inproceedings{hou-etal-2020-try,
title = "Try to Substitute: An Unsupervised {C}hinese Word Sense Disambiguation Method Based on {H}ow{N}et",
author = "Hou, Bairu and Qi, Fanchao and Zang, Yuan and Zhang, Xurui and Liu, Zhiyuan and Sun, Maosong",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
year = "2020",
}
This repository is mainly contributed by Bairu Hou, Fanchao Qi and Yuan Zang. To run our WSD model or use the WSD dataset, please refer to the following instructions.
- torch==1.3.1
- torchvision==0.4.2
- transformers==2.8.0
- OpenHowNet==0.0.1a8
Make sure you have run the following codes to complete the installation of
OpenHowNet
before running any codes in this repo.
import OpenHowNet
OpenHowNet.download()
- Build the vocabulary table either from a corpus or an external knowledge base like HowNet. We use the Corpus of People Daily. The corpus is already in the
data
directory. To tokenize the corpus and get the necessray file, please use the following command.
mkdir aux_files
python data_util.py
- Other corpus will also be fine.The only thing is you need to modify the
tokenize_corpus
function in thedata_util.py
file to process corpus file with different format.
The current available HowNet-based Chinese WSD dataset is based on an outdated version of HowNet that cannot be found now. To evaluation and further acamedic use, we build a new and larger HowNet-based Chinese WSD dataset based on the Chinese Word Sense Annotated Corpus used in SemEval-2007 task 5.
You can load the dataset with either eval
in Python or json
.
dataset = []
with open("data/dataset.txt",'r',encoding = 'utf-8') as f:
for line in f:
sample = eval(line.stri(p))
dataset.append(sample)
Each line in the file will be transformed to a dict
data type in Python. The keys of the dict include:
context
: A word list of the sentence include a token <target>
that masks the targeted polysemous word.
part-of-speech
: A list of the part-of-speech for each token in the sentence.
target_word
: The original polysemous word in the sentence masked by <target>
target_position
: The position of the targeted polysemous word in the word list
target_word_pos
: The part-of-speech of targeted polysemous word
sense
: The correct sense of the targeted polysemous word in the context. Represented by a set of sememes from HowNet
dataset = []
with open("data/dataset.json",'r',encoding='utf-8') as f:
for line in f:
sample = json.loads(line.strip())
dataset.append(sample)
You may need to manually transform some datatype if you load with json, such as the target_word_pos
, which should be an integer.
You can load the dataset for the follwoing evaluation or in your own research.
We recommend use cuda to accelerate the inference. Make sure you have generated the necessary files and put the dataset file in the data/
directory.
CUDA_VISIBLE_DEVICES=0 python run_model.py
The command will test the model on the whole dataset and generate a log file for further evaluation.
After you get the log file, you can evaluate it with the following command on various metrics.
python parse_log.py --model bert