famrashel/idn-tagged-corpus

The original sentences are not included.

Opened this issue · 3 comments

I would like to test a tokenizer, but this corpus does not include the original sentences. Would it be possible to include the original sentences?

Some of the sentences seem to come from here: http://www.panl10n.net/english/outputs/Indonesia/BPPT/0902/BPPTIndToEngCorpusHalfM.zip

But I could not find the first sentence:

<kalimat id=1>
Kera	NN
untuk	SC
amankan	VB
pesta olahraga	NN
</kalimat>

and the id's did not appear to match the sentence position.

Thank you for the inquiry. I think it is possible.
It is originally constructed from the first 10,000 sentences in IDENTIC corpus by Larasati. As far as I know, it includes sentences from BPPT corpus.

I would like to make sure that everything is consistent before commit. In case you are in a hurry, I attached the file which contains the original sentences. It is also very much appreciated if you can report any inconsistency :-)

raw_sentence.txt