Repository for the Sense Complexity Dataset (SeCoDa)
For more information on the SeCoDa, see the paper.
Publications using this dataset must include a reference to the following publication:
SeCoDa: Sense Complexity Dataset. David Strohmaier, Sian Gooding, Shiva Taslimipoor, Ekaterina Kochmar. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 5964–5969, Marseille, 11–16 May 2020
The dataset is based on the earlier CWIG3G2 dataset, see the paper and website. The relevant citation is
Seid Muhie Yimam, Sanja Štajner, Martin Riedl, and Chris Biemann (2017): CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups. In Proceedings of The 8th International Joint Conference on Natural Language Processing (IJCNLP 2017). Taipei, Taiwan
The complexity data can be found in the CWIG3G2 dataset and combined with the senses provided by SeCoDa.
Main data are found in SeCoDa.tsv. The columns are structured as follows.
- Token to be disambiguated.
- Offset start for token in context
- Offset end for token in context
- Context (sentence in which token occurs)
- Selected sense
- Comments (also contains MWE information)
Example:
target | offset_start | offset_end | context | sense | comments |
---|---|---|---|---|---|
abroad | 39 | 45 | As we emerge... | OTHER COUNTRY... | - |
abroad | 39 | 45 | As we emerge... | OTHER COUNTRY... | - |
abroad | 73 | 79 | #1-8 The speech... | OTHER COUNTRY... | - |
The senses are drawn from the Cambridge Advanced Learner's Dictionary.
UPDATE: Two missing entries have been added and typos in comments have been corrected.
UPDATE: Added further information to readme.
This work is licensed under a Creative Commons Attribution-NonCommerial-ShareAlike 4.0 International License.