CORD-19_KTool

Knowledge tool for COVID-19 papers on the CORD-19 dataset.

Situation

Scientists have a hard time browsing through the huge amount of papers about COVID-19, and there is a lack of tools with good search interfaces to retrieve the most prominent ones for the researcher.

Task

Devise a proof of concept tool that incorporates, in a novel way, aspects of Information Retrieval (IR) and Extraction (IE) applied to the COVID-19 Open Research Dataset (CORD-19). The main focus of this work is to provide researchers with a better search tool for COVID-19 related papers, helping them find reference papers and highlight relevant entities in text

Action

We applied Latent Dirichlet Allocation (LDA - NLTK) to model, based on research aspects, the topics of all English abstracts in a big COVID-19 paper dataset called CORD-19. Research aspects of each paper were extracted with transformer model CODA-19, trained with data from 10k CORD-19 abstracts. Relevant named entities of each abstract were extracted and linked to the corresponding UMLS concept with SciSpacy. Regular expressions and the K-Nearest Neighbors algorithm were used to rank relevant papers.

Results

Our tool has shown the potential to assist researchers by automating a topic-based search of CORD-19 papers. Nonetheless, we identified that more fine-tuned topic modeling parameters and increased accuracy of the research aspect classifier model could lead to a more accurate and reliable tool

Use case diagram

Link for arxiv paper

Example use case 1

List of COVID-19 abstracts whose finding/contribution addresses the topic “patients, day, median, hospital, iqr, admission, range, died and years”, ordered by date of publication

Example use case 2

List abstracts whose background addresses the topic of environmental stability of the virus

Example use case 3

Individual abstract visualization, with research aspects highlighted in different colors and UMLS terms visible to the user

pivettamarcos/CORD-19_KTool