Source codes of our EMNLP2017 paper Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components
You need to prepare a training corpus and the Chinese subcharacter radicals or components.
- Training corpus. Download Chinese Wikipedia Dump. Following the instractions on the blog, you can extract the raw content from the xml file and do data preprocessing such as removing pure digits and non Chinese characters. Alternatively, you can download the corpus after preprocessing at the onlibe baidu box.
- Subcharacter radicals and components. Deploy the scrapy codes in
JWE/ChineseCharCrawler
on Scrapy Cloud, you can crawl the resource from HTTPCN. We provide a copy of the data in./subcharacters
for reserach convenience. The copyright and all rights therein of the subcharacter data are reserved by the website HTTPCN.
cd JWE/src
, compile the code bymake all
.- run
./jwe
for parameters details. - run
./run.sh
to start the model training, you may modify the parameters in filerun.sh
. - Input files format:
Corpus
wiki.txt
contains segmented Chinese words with UTF-8 encoding; Subcharacterscomp.txt
contains a list of components which are seperated by blank spaces;char2comp.txt
, each line consists of a Chinese character and its components in the following format:
侩 亻 人 云
侨 亻 乔
侧 亻 贝 刂
侦 亻 卜 贝
Two Chinese word similarity datasets 240.txt
and 297.txt
and one Chinese analogy dataset analogy.txt
in JWE/evaluation
folder are provided by (Chen et al., IJCAI, 2015).
cd JWE/src
, then
- run
python word_sim.py -s <similarity_file> -e <embed_file>
for word similarity evaluation, wheresimilarity_file
is the word similarity file, e.g.,240.txt
or297.txt
,embed_file
is the trained word embedding file. - run
python word_analogy.py -a <analogy_file> -e <embed_file>
or./word_analogy <embed_file> <analogy_file>
for word analogy evaluation.