KoreanNet

KoreanNet is a neural architecture for modeling the unique compositional orthography of the Korean language. It decomposes each character to extract underlying jamo letters (basic phonetic units) from which character represenations are composed. For example, the embedding for the character 갔 is a function of ㄱ, ㅏ, and ㅆ. The decomposition can be performed deterministically and efficiently by simple Unicode manipulation: we use a robust implementation provided here.

This package plugs KoreanNet into the wonderful BiLSTM parser of Kiperwasser and Goldberg (2016). The jamo-based model achieves the high performance of character-based models while eschewing the need to store combinatorially many character types as lookup parameters. There is further improvement when jamos, characters, and words are used in conjunction. For details please refer to the paper.

Prerequisites

Training Commands

The code has disabled data shuffling for reproducibility in experiments. To enable shuffling, uncomment random.shuffle(shuffledData) in arc_hybrid.py. The Korean word embeddings were induced by running CCA on a Wikipedia dump and are available here.

Word only

python src/parser.py --dynet-seed 123456789 --dynet-mem 2000 --outdir ../scratch/word --train ../data/ko-universal-train.conll.shuffled --dev ../data/ko-universal-dev.conll --extrn ../data/emb100.ko --wembedding 100

Character only

python src/parser.py --dynet-seed 123456789 --dynet-mem 2000 --outdir ../scratch/char --train ../data/ko-universal-train.conll.shuffled --dev ../data/ko-universal-dev.conll  --cembedding 100  --usechar --noword

Jamo only

python src/parser.py --dynet-seed 123456789 --dynet-mem 2000 --outdir ../scratch/jamo --train ../data/ko-universal-train.conll.shuffled --dev ../data/ko-universal-dev.conll  --cembedding 100  --usejamo --noword

Word, character, and jamo

python src/parser.py --dynet-seed 123456789 --dynet-mem 3000 --outdir ../scratch/word-char-jamo --train ../data/ko-universal-train.conll.shuffled --dev ../data/ko-universal-dev.conll --extrn  ../data/emb100.ko --wembedding 100 --cembedding 100  --usechar --usejamo

Parsing Commands

python src/parser.py --predict --outdir ../scratch/jamo --test  ../data/ko-universal-test.conll --model ../scratch/jamo/model

Jamo Decomposition

If you want to just use the jamo decomposition for your task, the decompose function in jamo.py is the one you are looking for.

Reference

A Sub-Character Architecture for Korean Language Processing.

karlstratos/koreannet