A literature corpus on the origins of SARS-CoV-2 virus
To build a full picture of previous studies on the origins of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2), we exploit an active learning-based approach to screen scholarly articles about the origins of SARS-CoV-2 from many scientific publications. In more detail, six seed articles were utilized to manually curate 170 relevant articles and 300 nonrelevant articles. Then, an active learning-based approach with three query strategies and three base classifiers is trained to screen the articles about the origins of SARS-CoV-2. Extensive experimental results show that our active learning-based approach outperforms traditional counterparts, and the uncertain sampling query strategy performs best among the three strategies. By manually checking the top 1,000 articles of each base classifier, we ultimately screened 715 unique scholarly articles to create a publicly available peer-reviewed literature corpus, COVID-Origin. This indicates that our approach for screening articles about the origins of SARS-CoV-2 is feasible.
We highly appreciate any suggestin and comment.
The COVID-Origin is a free literature corpus on the origins of SARS-CoV-2 virus. You can redistribute it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
You should have received a copy of the GNU General Public License along with MLSSVR; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
COVID-Origin.xlsx
: The literature corpus on the origins of SARS-CoV-2 virus. The documents in Sheet Documents in Seed Dataset
come from our annotated seed dataset, and the documents in Sheet Screened Documents
come from our approach.
seed_dataset.xlsx
: The annotated seed documents.
Data of Tables and Figures.xlsx
: The data for all tables and figures in our article.
If you find this corpus usefull, please cite this corpus as follows:
Xin An, Mengmeng Zhang, and Shuo Xu, 2022. An Active Learning-based Approach for Screening Scholarly Articles about the Origins of SARS-CoV-2. PLoS ONE, Vol. 17, No. 9, pp. e0273725.