Definition of "very small corpora"

Question

Definition of "very small corpora"

Closed this issue 8 years ago · 4 comments

(with some exceptions -- Japanese because of the license, very small corpora and corpora which cannot be reliably detokenized).

How would you define "very small corpora" would this mean that Kazakh and Buryat are excluded ?

Answer 1 · 2016-07-26T21:29:44.000Z

Our worry about very small corpora is that their test set is so small that even several words make a considerable percentage (6 words in the Kazakh test set are more than 1%) -- we wanted to avoid people spending most of their time working on the smallest corpus to get bet overall score.

Another worry was whether we could get enough raw tests.

But the first issue could be alleviated by changing an overall score computation. Will discuss it in a new issue.

Answer 2 · 2016-07-27T09:17:25.000Z

There is no problem with getting raw text for Kazakh or Buryat. As for the other points we can take it up in the new issue. One other possibility would be to have a "small corpus" track where subsets of all the corpora are given which are approximately the same size.

The visibility of being included in shared tasks like this is a massive motivation for people planning to work on open UD-based treebanks, and it would be a shame to turn around to them and say "well, you didn't do enough work", especially without explicitly saying how big or how small is required.

In any case regarding the average, it's two languages out of 40-50, so even if people really tune the hell out of Kazakh and Buryat, I don't expect it would have a massive effect on the final score.

Answer 3 · 2016-07-28T18:05:28.000Z

You have a very good point regarding "not being part of a shared task".

We therefore deleted the non-inclusion of very small corpora. Therefore, we currently suggest to leave out Japanese [because of the licence of the original corpus] and the corpora we cannot reliably detokenized [Old Church Slavonic and Gothic being candidates, but if someone is able to get additional raw data, it should be fine].

Answer 4 · 2016-08-24T10:00:03.000Z

Note that the final decision, as per the Berlin meeting, is to re-introduce a lower limit on corpus size: the test data must have at least 10,000 words and the development data should ideally also reach 10,000 words, although this is not a hard constraint. There is no size requirement for training data—it can be zero.