/jesc_small

Small Japanese-English Subtitle Corpus

jesc_small

Small Japanese-English Subtitle Corpus. Sentences are extracted from JESC: Japanese-English Subtitle Corpus, and filtered with the length of 4 to 16 words.

Both Japanese and English sentences are tokenized with StanfordNLP (v0.2.0).

All texts are encoded in UTF-8. Sentence separator is '\n' and word separator is ' '.

Additionally, all tokenized data can be downloaded from here.

Corpus statistics

File #sentences #words #vocabulary
train.en 100,000 809,353 29,682
train.ja 100,000 808,157 46,471
dev.en 1,000 8,025 1,827
dev.ja 1,000 8,163 2,340
test.en 1,000 8,057 1,805
test.ja 1,000 8,084 2,306

This repo is inspired by small_parallel_enja.