/KoreanWordVectors

Subword-level Word Vector Representations for Korean (ACL 2018)

Primary LanguageC++

Word Vector Representation for Korean

Subword-level Word Vector Representations for Korean
Sungjoon Park, Jeongmin Byun, Sion Baek, Yongseok Cho, Alice Oh
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018)

Abstract

Research on distributed word representations is focused on widely-used languages such as English. Although the same methods can be used for other languages, language-specific knowledge can enhance the accuracy and richness of word vector representations. In this paper, we look at improving distributed word representations for Korean using knowledge about the unique linguistic structure of Korean. Specifically, we decompose Korean words into the jamo-level, beyond the character-level, allowing a systematic use of subword information. To evaluate the vectors, we develop Korean test sets for word similarity and analogy and make them publicly available. The results show that our simple method outperforms word2vec and character-level Skip-Grams on semantic and syntactic similarity and analogy tasks and contributes positively toward down-stream NLP tasks such as sentiment analysis.

Dataset

We open our evaluation dataset for Korean word vectors. Details are described below. We plan to develop more evaluation sets for Korean NLP communities, so any comments for theses sets or collaboration for constructing other sets are welcome!

1. WS-353 for word similarity (Korean)

  • 2 graduate students translated original (English) set.
  • 14 native Korean speakers participated in evaluation of the set.
  • Excluded the minimum and maximum scores and compute the mean of the rest of the scores.
  • .82 correlation with original English set.
  • Some of words are replaced by more familiar words to Korean. ( e.g.) Arafat -> 안중근 )

2. Word Analogies (Korean)

  • 10,000 items. 5,000 for semantic and 5,000 syntactic items.
  • 5 categories for semantic and syntactic features.
  • Each category contains 1,000 items.
  • Syntactic Features (with an example) :
    • Case : 자동차 자동차를 인터넷 인터넷을
    • Tense 가다 갔다 공부하다 공부했다
    • Voice 갈다 갈리다 거래하다 거래되다
    • Verb form 가다 가고 놓다 놓고
    • Honorific 가다 가시다 공부하다 공부하시다
  • Semantic Features (with an example):
    • Capital-countries 아테네 그리스 바그다드 이라크
    • male-female 남자 여자 아버지 어머니
    • name-nationality 간디 인도 나폴레옹 프랑스
    • country-language 아르헨티나 스페인어 미국 영어
    • misc 개 강아지 소 송아지

Korean word vector representation learning

Building fastText for Korean

Before you start training Korean word vectors, you should build the source of subword-level Korean word vectors (a.k.a., Korean FastText) by using make.

$ cd src
$ make

This will produce object files for all the classes as well as the main binary fasttext.

1. Parse Korean documents.

First, you should parse a Korean document with decompose_letters.py This file will decompose original Korean letters in the document, generating a parsed document. The parsed document will be used as training data of the vectors. An example use case is as follows:

python decompose_letters.py [input_file_name] [parsed_file_name]

2. Train Korean word vectors.

Then, you can train subword-level word vectors for Korean. The source code depends on the implementation of FastText. Thus you can execute the complied source as like the original FastText. Note that the source code will accept the output file [parsed_file_name] generated by decompose_letters.py. An example use case is as follows:

[fastText_executable_path] skipgram -input [parsed_file_name] -output [output_file_name] -minCount 10 -minjn 3 -maxjn 5 -minn 1 -maxn 4 -dim 300 -ws 5 -epoch 5 -neg 5 -loss ns -thread 16

The full list of parameters are given below.

-minCount : minimal number of word occurences [5]
-bucket : number of buckets [10000000]
-minn : min length of char ngram [1]
-maxn : max length of char ngram [4]
-minjn : min length of jamo ngram [3]
-maxjn : max length of jamo ngram [5]
-emptyjschar : empty jongsung symbol ["e"]
-t : sampling threshold [1e-4]
-lr : learning rate [0.05]
-dim : size of word vectors [100]
-ws : size of the context window [5]
-loss : loss function {ns, hs, softmax} [softmax]
-neg : number of negatives sampled [5]
-epoch : number of epochs [5]
-thread : number of threads [12]
-verbose : verbosity level [2]

As written in the paper, the default number of character-level n-grams is set to 1-4, and the number of jamo-level n-grams is set to 3-5. As the number of n-grams increases, you should adjust the number of maximum unique n-grams (bucket), otherwise some n-grams will be overridden. We recommend 10,000,000 for approximately 3GB of (parsed) Korean corpus.

Constructing Korean OOV word vectors

The trained output file [output_file_name].bin can be used to compute word vectors for OOVs. Provided you have a text file queries.txt containing Korean decomposed words for which you want to compute vectors, use the following command:

$ [fastText_executable_path] print-word-vectors model.bin < queries.txt

Note that queries.txt should contain decomposed Korean words, such as ㄱㅏㅇㅇㅏeㅈㅣe for 강아지. You can also use jamo_split method in decompose_letters.py to obtain decomposed Korean words.

Reference

Please cite the followings if using this code for learning word representations for Korean or evaluating word vectors using the evaluation sets.

@inproceedings{park-etal-2018-subword,
    title = "Subword-level Word Vector Representations for {K}orean",
    author = "Park, Sungjoon and
      Byun, Jeongmin and
      Baek, Sion and
      Cho, Yongseok and
      Oh, Alice",
    booktitle = "Proceedings of the 56th Annual Meeting of the ACL",
    year = "2018",
    pages = "2429--2438"
}

Change Log

01-11-19 : Add implementations. version 1.0 05-04-18 : Initial upload of datasets. version 1.0