The original sentences are not included.
Opened this issue · 3 comments
I would like to test a tokenizer, but this corpus does not include the original sentences. Would it be possible to include the original sentences?
Some of the sentences seem to come from here: http://www.panl10n.net/english/outputs/Indonesia/BPPT/0902/BPPTIndToEngCorpusHalfM.zip
But I could not find the first sentence:
<kalimat id=1>
Kera NN
untuk SC
amankan VB
pesta olahraga NN
</kalimat>
and the id's did not appear to match the sentence position.
Thank you for the inquiry. I think it is possible.
It is originally constructed from the first 10,000 sentences in IDENTIC corpus by Larasati. As far as I know, it includes sentences from BPPT corpus.
I would like to make sure that everything is consistent before commit. In case you are in a hurry, I attached the file which contains the original sentences. It is also very much appreciated if you can report any inconsistency :-)