COVID-19: Understanding the range of incubation periods and how long individuals are contagious after recovery.
We utilized the Semantic Scholar COVID-19 Open Research Dataset (CORD-19) as well as the COVID-19 Literature Knowledge Graph from Steenwinckel et al. derived from the CORD-19 dataset to extract information regarding the COVID-19 incubation and contagious periods.
Our implementation thus represents the following steps:
- Identify keywords associated with incubation period
- Extract the relevant papers from the CORD-19 dataset using regex query
- Use the CDQA library and spaCy's NER model to find the 'answer' to the query 'what is the incubation period'
- Filter extremities and unrecognizable characters
- Repeat the third step again
- Generate the Page Rankings
- Utilize the Page Rank values to add weights to number of days suggested by each paper.
Results - The incubation period is 3.5 to 13.5 days (mean = 8.41 days, SD = 5), weighted incubation period is 8.38 . The contagious period is 3 to 7 days (mean = 5.43 days, SD = 2.12), weighted contagious period is 5.38 days.
This directory contains all of our implementation (code) files along with the additional assets detailed below.
The assets
directory houses all of the relevant .csv, .tsv files, and the cleaned COVID-19 knowledge graph.
final_ib.csv
contains the information regarding literature from CORD-19 that pertain to incubation/contagious periods.
final_ib_pagerankings_title.tsv
contains the resulting page rankings after running the PageRanking algorithm using the NetworkX
library with the respective title and DOI.
This compressed file contains our modified the knowledge graph in N-Triples format. Note that the knowledge graph was modified due to the parsing errors that were in the original literature knowledge graph.
Out of all the papers and their texts, we have to find sentences that mention incubation period. But, the search can not be straight forward as the searching just the incubation period can have false cases like incubation period of different diseases, no mention of days but just incubation period and many others. So, we used different parameters using regex to find the sentences that are relevant to our search. We apply the same method to find the contagious period as well.
Because we wanted to ensure that we utilized the most credible papers, we opted to use the PageRank algorithm to generate the page ranks based on the number of times a paper has been cited. These rankings were then used as weights in determining the incubation and contagious periods.
page_rank.ipynb
can be opened in either Google Colab or Jupyter Notebook (Note: we used Google Colab to run this file so the paths will need to be updated since we were accessing files in our own Google Drive). There are instructions as well as additional implementation explanations within the .ipynb
. The first cell can be run to install the dependency libraries: rdflib
, networkx
, tqdm
. Documentation for each of the libraries is listed below:
rdflib
: used to load the literature knowledge graph and generate a citation subgraph.networkx
: used to run the PageRank algorithmtqdm
: this library is purely optional, but it is helpful for displaying progress bars.
Once we get the csv file with sentences we clean some basic symbols. Then we apply the CDQA library and find the pos tags of sentences. Then using the NER model and CDQA library we extract the number of days from each sentence (of each paper). Then after manual incpection we find the abnormalities and extremeties and clean them again. Once we get number of days suggested by each paper multiply it with its pagerank like a weight to get the weighted average.
We randomly took 100 papers and noted their suggested incubation period and another random 100 for contagious period, the average incubation period we got was 8.5 days which is very close to 8.8 that we found and average contagious period came out to be 4.5 days which is also close to 5.2 that we found.
- The
presentation
directory holds our presentation slides in.ppt
format and thereport
directory holds our final report in.pdf
format.