Wikinflection Corpus

An inflectional corpus with inflectional morpheme annotations, in 68 languages. 216K lemmas, 5.4M words. Based on the English Wiktionary (en.wiktionary.org), generated by Wikinflection (Metheniti and Neumann, 2018), evaluated with UniMorph 2.0 (Kirov et al.m 2018).

List of languages and size can be found in corpus_size.csv.

Paper

Metheniti, E. and Neumann, G. (2020). Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus. In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC2020), Marseille, France, May. European Language Resources Association (ELRA). [link] [BibTeX]

References

Kirov, C., Cotterell, R., Sylak-Glassman, J., Walther, G., Vylomova, E., Xia, P., Faruqui, M., Mielke, S., Mc-Carthy, A., Kubler, S., Yarowsky, D., Eisner, J., and Hulden, M. (2018). UniMorph 2.0: Universal Morphology. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC2018), Miyazaki, Japan, May. European Language Resources Association (ELRA).

Metheniti, E. and Neumann, G. (2018). Wikinflection: Massive semi-supervised generation of multilingual inflectional corpus from Wiktionary. In Proceedings of the 17th International Workshop on Treebanks and Linguistic Theories (TLT 2018), December 13–14, 2018, Oslo University, Norway, number 155, pages 147–161. Linkoping University Electronic Press.

lenakmeth/Wikinflection-Corpus

Wikinflection Corpus

Paper

References