KoreanNet

KoreanNet is a neural architecture for modeling the unique compositional orthography of the Korean language. It decomposes each character to extract underlying jamo letters (basic phonetic units) from which character represenations are composed. For example, the embedding for the character is a function of , , and . The decomposition can be performed deterministically and efficiently by simple Unicode manipulation: we use a robust implementation provided here.

This package plugs KoreanNet into the wonderful BiLSTM parser of Kiperwasser and Goldberg (2016). The jamo-based model achieves the high performance of character-based models while eschewing the need to store combinatorially many character types as lookup parameters. There is further improvement when jamos, characters, and words are used in conjunction. For details please refer to the paper.

Prerequisites

Training Commands

The code has disabled data shuffling for reproducibility in experiments. To enable shuffling, uncomment random.shuffle(shuffledData) in arc_hybrid.py. The Korean word embeddings were induced by running CCA on a Wikipedia dump and are available here.

Word only
python src/parser.py --dynet-seed 123456789 --dynet-mem 2000 --outdir ../scratch/word --train ../data/ko-universal-train.conll.shuffled --dev ../data/ko-universal-dev.conll --extrn ../data/emb100.ko --wembedding 100
Character only
python src/parser.py --dynet-seed 123456789 --dynet-mem 2000 --outdir ../scratch/char --train ../data/ko-universal-train.conll.shuffled --dev ../data/ko-universal-dev.conll  --cembedding 100  --usechar --noword
Jamo only
python src/parser.py --dynet-seed 123456789 --dynet-mem 2000 --outdir ../scratch/jamo --train ../data/ko-universal-train.conll.shuffled --dev ../data/ko-universal-dev.conll  --cembedding 100  --usejamo --noword
Word, character, and jamo
python src/parser.py --dynet-seed 123456789 --dynet-mem 3000 --outdir ../scratch/word-char-jamo --train ../data/ko-universal-train.conll.shuffled --dev ../data/ko-universal-dev.conll --extrn  ../data/emb100.ko --wembedding 100 --cembedding 100  --usechar --usejamo

Parsing Commands

python src/parser.py --predict --outdir ../scratch/jamo --test  ../data/ko-universal-test.conll --model ../scratch/jamo/model

Jamo Decomposition

If you want to just use the jamo decomposition for your task, the decompose function in jamo.py is the one you are looking for.

Reference

A Sub-Character Architecture for Korean Language Processing.