/Detecting-Generated-Scientific-Papers

Can you spot automatically generated scientific excerpts?

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Detecting generated scientific papers

Competition website here...

Description

This competition is a part of the shared task hosted within the third workshop on Scholarly Document Processing (SDP 2022), being held in association with the 29th International Conference on Computational Linguistics (COLING 2022).

There are increasing reports that research papers can be written by computers, which presents a series of concerns (e.g., see [1]). In this challenge, we explore the state of the art in detecting automatically generated papers. We frame the detection problem as a binary classification task: given an excerpt of text, label it as either human-written or machine-generated. We provide a corpus of over 5000 excerpts from automatically written papers, based on the work by Cabanac et al. [2], as well as documents collected by Elsevier publishing and editorial teams. As a control set, we provide a 5x larger corpus of openly accessible human-written as well as generated papers from the same scientific domains of documents. We also encourage contributions that aim to extend this dataset with other computer-generated scientific papers, or papers that propose valid metrics to assess automatically generated papers against those written by humans.

Acknowledgements

We thank Cyril Labbé, Basile Dubois-Binnaire, Guillaume Cabanac, and Alexander Magazinov for their input in the ideation phase of the task preparation.

Links

[1] Holly Else. (2021). "'Tortured phrases' give away fabricated research papers." Nature.

[2] Guillaume Cabanac, Cyril Labbé, and Alexander Magazinov. (2021). "Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals."