yusugomori/jesc_small

Small Japanese-English Subtitle Corpus

jesc_small

Small Japanese-English Subtitle Corpus. Sentences are extracted from JESC: Japanese-English Subtitle Corpus, and filtered with the length of 4 to 16 words.

Both Japanese and English sentences are tokenized with StanfordNLP (v0.2.0).

All texts are encoded in UTF-8. Sentence separator is '\n' and word separator is ' '.

Additionally, all tokenized data can be downloaded from here.

Corpus statistics

File	#sentences	#words	#vocabulary
train.en	100,000	809,353	29,682
train.ja	100,000	808,157	46,471
dev.en	1,000	8,025	1,827
dev.ja	1,000	8,163	2,340
test.en	1,000	8,057	1,805
test.ja	1,000	8,084	2,306

This repo is inspired by small_parallel_enja.