Keyword-Extraction-Datasets
This repository contains seven annotated datasets for automatic keyword extraction task. Every dataset contains a document (.txt or .abstr) and its corresponding gold-standard keywords list (.key or .uncontr). These datasets were used for our study of supervised and unsupervised keyword extraction. Following are the links to our published works.
- sCAKE: Semantic Connectivity Aware Keyword Extraction
- Complex Network based Supervised Keyword Extractor.
Following are the datasets and the original papers which proposed them.
- Hulth2003: Contains abstracts from Inspec dataset. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
- WWW and KDD: CS abstracts from KDD and WWW conferences. We have only kept those documents that contain at least two sentences and atleast one gold-standard keyword. Originally downloaded from https://www.dropbox.com/s/3c57qar1b0xseob/kpshare.tgz?dl=0 (Link is not available now). Full dataset can be downloaded from https://github.com/LIAAD/KeywordExtractor-Datasets/tree/master/datasets.
- Marujo2012: News articles. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
- Krapivin2012: ACM full papers. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
- Semeval2010: ACM full papers. Originally downloaded from https://github.com/snkim/AutomaticKeyphraseExtraction.
- NLM500: PubMed documents. Originally downloaded from https://github.com/zelandiya/keyword-extraction-datasets. Created for abstractive KE task.
Dataset details and collection statistics
Dataset | |D| | Lavg | Navg | Kavg | KPavg | Description |
---|---|---|---|---|---|---|
Hulth2003 | 1500 | 129 | 23 | 10 | 90.07 | Abstracts from Inspec dataset |
WWW | 1248 | 174 | 9 | 5 | 64.97 | Abstracts from CS articles published in KDD conference |
KDD | 704 | 204 | 8 | 4 | 68.12 | Abstracts from CS articles published in WWW conference |
Marujo2012 | 450 | 427 | 69 | 48 | 99.31 | Online news articles |
Krapivin2009 | 2304 | 7961 | 11 | 5 | 96.91 | Full scientific articles from ACM |
SemEval2010 | 244 | 8085 | 34 | 16 | 95.89 | Full scientific articles from ACM, created for SemEval2010 Task 5 |
NLM500 | 500 | 4854 | 27 | 14 | 71.35 | Full papers from PubMed database |
|D|: Number of documents. Lavg: Average document length, in words. Navg: Average gold-standard keywords (unigrams) assigned per document. Kavg: Average gold-standard keyphrases (n-grams) assigned per document. KPavg: Average percentage of keyphrases present in the text
Citations:
Following are the citations for original papers.
Hulth2003
@inproceedings{hulth2003improved,
title = "Improved Automatic Keyword Extraction given more Linguistic Knowledge",
author = "Hulth, Anette",
booktitle = "Proceedings of the 2003 Conference on EMNLP",
pages = "216--223",
year = "2003",
organization = "ACL"
}
Krapivin2009
@article{krapivin2009large,
title = "Large Dataset for Keyphrases Extraction",
author = "Krapivin, Mikalai and Autaeu, Aliaksandr and Marchese, Maurizio",
journal = "Technical Report DISI-09-055",
year = "2009",
publisher = "University of Trento"
}
NLM500
@inproceedings{aronson2000nlm,
title = "The NLM Indexing Initiative",
author = "Aronson and others",
booktitle = "Proceedings of the AMIA Symposium",
pages = "17",
year = "2000",
organization = "American Medical Informatics Association"
}
SemEval2010
@inproceedings{kim2010semeval,
title = "Semeval-2010 Task 5: Automatic Keyphrase Extraction from Scientific Articles",
author = "Kim, Su Nam and Medelyan, Olena and Kan, Min-Yen and Baldwin, Timothy",
booktitle = "Proceedings of the 5th International Workshop on Semantic Evaluation",
pages = "21--26",
year = "2010",
organization = "Association for Computational Linguistics"
}
Marujo2012
@inproceedings{marujo2012supervised,
title = "Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization",
author = "Marujo, Lu{\'\i}s and Gershman, Anatole and Carbonell, Jaime and Frederking, Robert and Neto, Joa{\`I}ƒo P",
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012)",
year = "2012"
}
WWW and KDD
@inproceedings{gollapalli2014extracting,
title = "Extracting keyphrases from research papers using citation networks",
author = "Gollapalli, Sujatha Das and Caragea, Cornelia",
booktitle = "Twenty-Eighth AAAI Conference on Artificial Intelligence",
year = "2014"
}