ryancotterell/sigmorphon2016

datasets: consider overlap!

jeisner opened this issue · 0 comments

It is important to control how much reuse there is of the same lexeme, within and across datasets.

If I am trying to assemble a partial paradigm, can I get useful information within my current training set? Can I get additional useful information by looking at the test inputs? Even more by looking at the training sets for other tasks? This is explicitly allowed according to the current website, but will it help?

Whether there's any benefit to clever methods (learning from partial paradigms directly or via nonparametrics) depends on the answers to these questions.