SeCoDa

Repository for the Sense Complexity Dataset (SeCoDa)

Paper

For more information on the SeCoDa, see the paper.

Publications using this dataset must include a reference to the following publication:

SeCoDa: Sense Complexity Dataset. David Strohmaier, Sian Gooding, Shiva Taslimipoor, Ekaterina Kochmar. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 5964–5969, Marseille, 11–16 May 2020

The dataset is based on the earlier CWIG3G2 dataset, see the paper and website. The relevant citation is

Seid Muhie Yimam, Sanja Štajner, Martin Riedl, and Chris Biemann (2017): CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups. In Proceedings of The 8th International Joint Conference on Natural Language Processing (IJCNLP 2017). Taipei, Taiwan

The complexity data can be found in the CWIG3G2 dataset and combined with the senses provided by SeCoDa.

Repository Content

Main data are found in SeCoDa.tsv. The columns are structured as follows.

Token to be disambiguated.
Offset start for token in context
Offset end for token in context
Context (sentence in which token occurs)
Selected sense
Comments (also contains MWE information)

Example:

target	offset_start	offset_end	context	sense	comments
abroad	39	45	As we emerge...	OTHER COUNTRY...	-
abroad	39	45	As we emerge...	OTHER COUNTRY...	-
abroad	73	79	#1-8 The speech...	OTHER COUNTRY...	-

The senses are drawn from the Cambridge Advanced Learner's Dictionary.

UPDATE: Two missing entries have been added and typos in comments have been corrected.

UPDATE: Added further information to readme.

This work is licensed under a Creative Commons Attribution-NonCommerial-ShareAlike 4.0 International License.

gabigaudeau/SeCoDa

SeCoDa

Paper

Repository Content