University of Ulsan - Korean-English Parallel Corpus (1.25M sentence pairs) with Korean Word-Sense Annotation has been built in NLP Lab., University of Ulsan, Rep. of Korea. (http://nlplab.ulsan.ac.kr).
UKren is a large-scale Korean-English Parallel Corpus with the detailed information as the following.
. Ukren: Korean-English Parallel Corpus.
. UKren_WS_Ann : Korean-English Parallel Corpus with Word-sense annotation for Korean by UTagger
. Total sentences: 1,251,075 pairs
. Average sentence length
. English: 11.5
. Korean: 8.5
. Korean with Word-sense annotation: 28.8
. Total tokens
. English: 14,387,731
. Korean: 10,693,999
. Korean with Word-sense annotation: 22,864,606
. Total vocabularies
. English: 353,153
. Korean: 827,315
. Korean with Word-sense annotation: 135,657
The Korean Word-Sense Annotation was conducted by UTagger (http://nlplab.ulsan.ac.kr/doku.php?id=utagger) that consists of the following processes:
. Korean morphological analysis
. POS tagging
. Sense-codes tagging (A sense-code, which represents a special sense of a word is defined in the Standard Korean Language Dictionary)
UKren_Sample.txt and UKren_WS_Ann_Sample.txt are the sample files with 5,000 sentence pairs. If you want to use the full corpus, please contact us through e-mail: nqphuoc@gmail.com