Knowledge tool for COVID-19 papers on the CORD-19 dataset.
Scientists have a hard time browsing through the huge amount of papers about COVID-19, and there is a lack of tools with good search interfaces to retrieve the most prominent ones for the researcher.
Devise a proof of concept tool that incorporates, in a novel way, aspects of Information Retrieval (IR) and Extraction (IE) applied to the COVID-19 Open Research Dataset (CORD-19). The main focus of this work is to provide researchers with a better search tool for COVID-19 related papers, helping them find reference papers and highlight relevant entities in text
We applied Latent Dirichlet Allocation (LDA - NLTK) to model, based on research aspects, the topics of all English abstracts in a big COVID-19 paper dataset called CORD-19. Research aspects of each paper were extracted with transformer model CODA-19, trained with data from 10k CORD-19 abstracts. Relevant named entities of each abstract were extracted and linked to the corresponding UMLS concept with SciSpacy. Regular expressions and the K-Nearest Neighbors algorithm were used to rank relevant papers.
Our tool has shown the potential to assist researchers by automating a topic-based search of CORD-19 papers. Nonetheless, we identified that more fine-tuned topic modeling parameters and increased accuracy of the research aspect classifier model could lead to a more accurate and reliable tool
List of COVID-19 abstracts whose finding/contribution addresses the topic “patients, day, median, hospital, iqr, admission, range, died and years”, ordered by date of publication
List abstracts whose background addresses the topic of environmental stability of the virus
Individual abstract visualization, with research aspects highlighted in different colors and UMLS terms visible to the user