/UKren

University of Ulsan - Korean-English Parallel Corpus with Korean Word-Sense Annotation

UKren

University of Ulsan - Korean-English Parallel Corpus (1.25M sentence pairs) with Korean Word-Sense Annotation has been built in NLP Lab., University of Ulsan, Rep. of Korea. (http://nlplab.ulsan.ac.kr).

UKren is a large-scale Korean-English Parallel Corpus with the detailed information as the following.

. Ukren: Korean-English Parallel Corpus.

. UKren_WS_Ann : Korean-English Parallel Corpus with Word-sense annotation for Korean by UTagger

. Total sentences: 1,251,075 pairs	
	
. Average sentence length

	. English: 11.5
	
	. Korean: 8.5
	
	. Korean with Word-sense annotation: 28.8
	
. Total tokens

	. English: 14,387,731
	
	. Korean:  10,693,999
	
	. Korean with Word-sense annotation: 22,864,606
	

. Total vocabularies

	. English: 353,153
	
	. Korean:  827,315
	
	. Korean with Word-sense annotation: 135,657

The Korean Word-Sense Annotation was conducted by UTagger (http://nlplab.ulsan.ac.kr/doku.php?id=utagger) that consists of the following processes:

. Korean morphological analysis

. POS tagging

. Sense-codes tagging (A sense-code, which represents a special sense of a word is defined in the Standard Korean Language Dictionary)

UKren_Sample.txt and UKren_WS_Ann_Sample.txt are the sample files with 5,000 sentence pairs. If you want to use the full corpus, please contact us through e-mail: nqphuoc@gmail.com