/synergy-dataset

SYNERGY - Open machine learning dataset on study selection in systematic reviews

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

SYNERGY dataset

DOI PyPI

SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points.

SYNERGY-banner.png

Get the data

The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. Install the package with:

pip install synergy-dataset

To download and build the SYNERGY dataset, run the following command in the command line:

python -m synergy_dataset get

To get an overview of the datasets and their properties, use synergy_dataset list and synergy_dataset show <DATASET_NAME>.

Datasets and variables

The SYNERGY dataset comprises the study selection of 26 systematic reviews. The dataset contains 169,288 records of which 2,834 records are manually labeled as inclusion by the authors of the systematic review. The list of systematic review and basic properties:

Nr Dataset Topic(s) Records Included %
1 Appenzeller-Herzog_2019 Medicine 2873 26 0.9
2 Bos_2018 Medicine 4878 10 0.2
3 Brouwer_2019 Psychology, Medicine 38114 62 0.2
4 Chou_2003 Medicine 1908 15 0.8
5 Chou_2004 Medicine 1630 9 0.6
6 Donners_2021 Medicine 258 15 5.8
7 Hall_2012 Computer science 8793 104 1.2
8 Jeyaraman_2020 Medicine 1175 96 8.2
9 Leenaars_2019 Psychology, Chemistry, Medicine 5812 17 0.3
10 Leenaars_2020 Medicine 7216 583 8.1
11 Meijboom_2021 Medicine 882 37 4.2
12 Menon_2022 Medicine 975 74 7.6
13 Moran_2021 Biology, Medicine 5214 111 2.1
14 Muthu_2021 Medicine 2719 336 12.4
15 Nelson_2002 Medicine 366 80 21.9
16 Oud_2018 Psychology, Medicine 952 20 2.1
17 Radjenovic_2013 Computer science 5935 48 0.8
18 Sep_2021 Psychology 271 40 14.8
19 Smid_2020 Computer science, Mathematics 2627 27 1
20 van_de_Schoot_2018 Psychology, Medicine 4544 38 0.8
21 van_der_Valk_2021 Medicine, Psychology 725 89 12.3
22 van_der_Waal_2022 Medicine 1970 33 1.7
23 van_Dis_2020 Psychology, Medicine 9128 72 0.8
24 Walker_2018 Biology, Medicine 48375 762 1.6
25 Wassenaar_2017 Medicine, Biology, Chemistry 7668 111 1.4
26 Wolters_2018 Medicine 4280 19 0.4

Each record in the dataset is an OpenAlex Work object (Copy at web.archive.org extracted on 2023-03-31).

Some of the notable variables are:

Variable Type Description
id String The OpenAlex ID for this work.
doi String The DOI identifier of the object if available
label_included Bin 1 for included records, 0 for excluded records after full text screening
title String The title of this work.
abstract String The abstract of this work. Stored as abstract_inverted_index, but available as plaintext abstract for machine learning purposes.
authorships List List of Authorship objects, each representing an author and their institution.
type String The type or genre of the work as defined by https://api.crossref.org/types.
publication_year Integer The year this work was published.
referenced_works List List of OpenAlex IDs for works that this work cites.
concepts List List of wikidata concept objects (or topics).
best_oa_location Object An object with the best available open access location for this work.
cited_by_count Integer The number of citations to this work at April 1st, 2023.

For the full list of variables, see this persistent copy of the OpenAlex Work Object documention: https://web.archive.org/web/20230104092916/https://docs.openalex.org/api-entities/works/work-object

Benchmark

Work in progress.

Attribution & License

We would like to thank the following authors for openly sharing the data correponding to their systematic review:

Marlies L.S. Heeres, Marijn Vellinga, P Whaley, Mostafa Mohseni, P.M.J. Welsing, Marleen L.M. Hermens, Richard Torkar, Holger Schielzeth, Marjan Hericko, Arnoud Arntz, Lisanne A. H. Bevers, Christian Appenzeller-Herzog, Michael J. DeVito, Juliette Legler, Rosalie W. M. Kempkes, Daniel Bos, Sanne C. Smid, Robyn B. Blain, Carin M. A. Rademaker, David De Jong, Antoine C. G. Egberts, Tijmen Geurts, Sathish Muthu, Suzanne C. van Veen, Janet D. Allan, Pamela Hartman, Eline S van der Valk, Mitzy Kennis, Wilhelmus Drinkenburg, R. Angela Sarabdjitsingh, Nicola P. Klein, Helga Gardarsdottir, Anouk A. M. T. Donners, Sonja D. Winter, Muriel A. Hagenaars, Erica L T van den Akker, Amir Abdelmoumen, Derek W. R. Gray, Kim Peterson, Eswar Ramakrishnan, Trevor J. Hall, Maurice Dematteis, Merel Ritskes-Hoitinga, Andrew A. Shapiro, Meike W. Vernooij, Maria Brouwer, Katherine E. Pelch, Milica Miočević, Eva A.M. van Dis, Ozair Abawi, Dimitrije Radjenović, Daniel McNeish, Peggy Nygren, Maikel van Berlo, Alwin D. R. Huitema, Nicholas P. Moran, Chad R. Blystone, Alishia D. Williams, Ruud N. J. M. A. Joosten, Klaus Reinhold, Pim N.H. Wassenaar, Sanne E. Hoeks, Anand Krishnan V. Iyer, Sjoerd A.A. van den Berg, Tim Kendall, Lieke H. van Huis, Rens van de Schoot, Nancy E. E. Van Loey, Julia M.L. Menon, Cathalijn H. C. Leenaars, Rogier E. J. Verhoef, Sarah Depaoli, Frank de Wolf, M.E. Hamaker, Rinske M van den Heuvel, Leonardo Trasande, Miranda Olff, Alfredo Sánchez-Tójar, M.H. Emmelot-Vonk, Kristina A. Thayer, Steven M. Teutsch, Elisabeth F.C. van Rossum, Bibian van der Voorn, Stephanie Holmgren, André Bleich, M.S. van der Waal, Frank J. Wolters, Hannah Ewald, Marian Joëls, Franck L. B. Meijboom, Yolanda B. de Rijke, Tobias Stalder, M. Arfan Ikram, P.A.L. Seghers, Marit Sijbrandij, Vincent L. Wester, Behnam Sabayan, Tim Mathes, Parvez Ahmad Ganie, Matthijs G. P. Feenstra, Abee L. Boyles, Matthijs Oud, Andrew A. Rooney, Rosanne W. Meijboom, Karl Heinz Weiss, Jan-Bas Prins, F. Struijs, David Bowes, Neeltje M. Batelaan, Reffat A. Segufa, Serena J. Counsell, Milou S. C. Sep, Aleš Živkovič, Madhan Jeyaraman, Sirwan K.L. Darweesh, Tineke Coenen-de Roo, Heidi Nelson, Roger Chou, Vickie R. Walker, Albert Hofman, Roger E. G. Schutgens, Rob B. M. de Vries, Zhongfang Fu, Pim Cuijpers, Christ Nolten, Krista Fischer, Janneke Elzinga, Roderick H. J. Houwen, Iris M. Engelhard, Linda Humphrey, Frans A. Stafleu, Simon Beecham, Mark Helfand, Thijs J. Giezen, Retha R. Newbold, Claudi L H Bockting, Sanaz Sedaghat, Elizabeth A. Clark

Run synergy_dataset attribution or see ATTRIBUTION.md for a complete attribution including references.

SYNERGY dataset is released under the CC0 1.0 license. SYNERGY consists of CC0 1.0 licensed metadata works published by OpenAlex. The Lens was used for data quality checks and imputing some missing variables.

Citing SYNERGY dataset

If you use SYNERGY in a scientific publication, we would appreciate references to:

De Bruin, Jonathan; Ma, Yongchao; Ferdinands, Gerbrich; Teijema, Jelle; Van de Schoot, Rens, 2023, "SYNERGY - Open machine learning dataset on study selection in systematic reviews", https://doi.org/10.34894/HE6NAQ, DataverseNL, V1

BibTeX reference:

@data{HE6NAQ_2023,
  author = {De Bruin, Jonathan and Ma, Yongchao and Ferdinands, Gerbrich and Teijema, Jelle and Van de Schoot, Rens},
  publisher = {DataverseNL},
  title = {{SYNERGY - Open machine learning dataset on study selection in systematic reviews}},
  year = {2023},
  version = {V1},
  doi = {10.34894/HE6NAQ},
  url = {https://doi.org/10.34894/HE6NAQ}
}

Contributing

We are welcoming contributions of all kinds. Some examples are:

  • Do you have an openly published systematic review dataset? Read about our ambition to develop SYNERGY+ (SYNERGY Plus), a much larger dataset with lots of new features.
  • Write an example or tutorial on how to use SYNERGY and all of its hidden capabilities.
  • Write integration to load SYNERGY into existing software like Spacy, Gensim, Tensorflow, Docker, Hugging Face.

Contact

Reach out on the Discussion forum.