Turkish morphology datasets

TrMor2018

New Turkish morphology dataset based on mixed genre text used in the following paper (pdf, code).

@article{DBLP:journals/corr/abs-1805-07946,
  author    = {Erenay Dayanik and Ekin Aky{\"{u}}rek and Deniz Yuret},
  title     = {MorphNet: {A} sequence-to-sequence model that combines morphological analysis and disambiguation},
  journal   = {CoRR},
  volume    = {abs/1805.07946},
  year      = {2018},
  url       = {http://arxiv.org/abs/1805.07946},
  archivePrefix = {arXiv},
  eprint    = {1805.07946},
  timestamp = {Mon, 13 Aug 2018 16:47:09 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1805-07946},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

trmor2018.train was semiautomatically annotated, randomly split (80-10-10%) five times and the average scores were reported in the paper.

The lines in the file are either XML tags indicating sentence and document boundaries, or contain tab-separated analyses for a single word:

word<tab>tag1<tab>tag2...

The first analysis (tag1) is the correct one. When none of the analyses were deemed correct, tag1 is '?' and the other tags are printed in a random order.

trmor2018.train is verified to have 95.56% accuracy using handtagged/trmor2018.gold, a subset that was manually annotated by two annotators with differences adjudicated by a third. Please do not copy any data between trmor2018.train and trmor2018.gold in future versions and do not use trmor2018.gold for training or testing models, otherwise we lose our ability to measure accuracy using an independently tagged reference.

The analyses were produced by a newer version of Kemal Oflazer's finite state transducers circa 2018 for Turkish morphological analysis, and xfst the Xerox Finite State software. Please use both with permission. Here are the data statistics:

trmor2018	train
Documents	390
Sentences	34673
Tokens	460669
Unambiguous	243866
Ambiguous	215024
Unknown	1779

TrMor2016

Test set used in the following paper which used the same training set as TrMor2006.

@paper{AAAI1612370,
  author = {Eray Yildiz and Caglar Tirkaz and H. Sahin and Mustafa Eren and Omer Sonmez},
  title = {A Morphology-Aware Network for Morphological Disambiguation},
  conference = {AAAI Conference on Artificial Intelligence},
  year = {2016},
  url = {https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12370}
}

Approximately 20000 tokens from the training set were manually retagged to obtain a larger test set. Unfortunately the paper did not exclude these tokens from the training set and they did not provide any inter-annotator agreement. The data format is the same as trmor2006, here are the data statistics:

trmor2016	test
Documents	42
Sentences	1286
Tokens	19262
Unambiguous	9782
Ambiguous	9446
Unknown	34

TrMor2006

Turkish morphology dataset based on news text used in the following paper.

@InProceedings{yuret-ture:2006:HLT-NAACL06-Main,
  author    = {Yuret, Deniz  and  Ture, Ferhan},
  title     = {Learning Morphological Disambiguation Rules for Turkish},
  booktitle = {Proceedings of the Human Language Technology Conference of the NAACL, Main Conference},
  month     = {June},
  year      = {2006},
  address   = {New York City, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {328--334},
  url       = {http://www.aclweb.org/anthology/N/N06/N06-1042}
}

Each line lists a token or tag followed by one or more possible lemma+tag analyses separated by whitespace. The first analysis is the correct one. Unknown tags are marked with the substring UNKNOWN. The analyses were produced by tr-tagger.tgz, Kemal Oflazer's finite state transducers circa 2006 for Turkish morphological analysis , and xfst the Xerox Finite State software. Please use both with permission. The training set was semi-automatically tagged and is not very accurate. The test set was hand-tagged but is very small. Here are the data statistics:

trmor2006	train	test
Documents	2383	3
Sentences	50674	42
Tokens	837524	862
Unambiguous	436406	482
Ambiguous	399216	379
Unknown	1902	1

ai-ku/TrMor2018

Turkish morphology datasets

TrMor2018

TrMor2016

TrMor2006