/COVID19-Literature-Clustering

Let's cluster similar research articles together to make it easier for health professionals to find relevant research articles, and responde to rapidly spreading COVID-19 promptly.

Primary LanguageHTML

COVID-19 Literature Clustering

Goal

Given a large amount of literature and rapidly spreading COVID-19, it is difficult for a scientist to keep up with the research community promptly. Can we cluster similar research articles together to make it easier for health professionals to find relevant research articles? Clustering can be used to create a tool to identify similar articles, given a target article. It can also reduce the number of articles one has to go through as one can focus on a cluster of articles.

View the Interactive COVID-19 Literature Clustering Plot

https://maksimekin.github.io/COVID19-Literature-Clustering/plots/t-sne_covid-19_interactive.html

t-SNE Output Clustered For Visualization

Approach:

  1. Unsupervised Learning task, because we don't have labels for the articles
  2. Clustering and Dimensionality Reduction task
  3. See how well labels from K-Means classify
  4. Use N-Grams with Hash Vectorizer
  5. Use plain text with Tfidf
  6. Use K-Means for clustering
  7. Use t-SNE for dimensionality reduction
  8. Use PCA for dimensionality reduction
  9. There is no continuous flow of data, no need to adjust to changing data, and the data is small enough to fit in memmory: Batch Learning
  10. Altough, there is no continuous flow of data, our approach has to be scalable as there will be more literature later

Dataset Description

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

Maps

Maps generated using Novel Corona Virus 2019 Dataset | Kaggle.

How to Cite This Work?

@inproceedings{COVID-19 Literature Clustering,
	author = {Eren, E. Maksim. Solovyev, Nick. Nicholas, Charles. Raff, Edward},
	title = {COVID-19 Literature Clustering},
	year = {2020},
	month = {March-April},
	location = {University of Maryland Baltimore County (UMBC), Baltimore, MD, USA},
	note={Malware Research Group},
	howpublished = {\url{https://github.com/MaksimEkin/COVID19-Literature-Clustering}}
}

Citation/Sources

Kaggle Submission: COVID-19 Literature Clustering | Kaggle

@inproceedings{Raff2020,
   author = {Raff, Edward and Nicholas, Charles and McLean, Mark},
   booktitle = {The Thirty-Fourth AAAI Conference on Artificial Intelligence},
   title = {{A New Burrows Wheeler Transform Markov Distance}},
   url = {http://arxiv.org/abs/1912.13046},
   year = {2020},
}
@misc{Kaggle,
	author = {Kaggle},
	title = {COVID-19 Open Research Dataset Challenge (CORD-19)},
	year = {2020},
	month = {March},
	note = {Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for   Security and Emerging Technology, Microsoft Research, and the National Library of Medicine - National Institutes of Health, in coordination with The White House Office of Science and Technology Policy.},
	howpublished = {\url{https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge}}
}
@inproceedings{Shakespeare,
	author = {Nicholas, Charles},
	title = {Mr. Shakespeare, Meet Mr. Tucker},
	booktitle = {High Performance Computing and Data Analytics Workshop},
	year = {2019},
	month = {September},
	location = { Linthicum Heights, MD, USA},
}
@inproceedings{raff_lzjd_2017,
	author = {Raff, Edward and Nicholas, Charles},
	title = {An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance},
	booktitle = {Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
	series = {KDD '17},
	year = {2017},
	isbn = {978-1-4503-4887-4},
	location = {Halifax, NS, Canada},
	pages = {1007--1015},
	numpages = {9},
	url = {http://doi.acm.org/10.1145/3097983.3098111},
	doi = {10.1145/3097983.3098111},
	acmid = {3098111},
	publisher = {ACM},
	address = {New York, NY, USA},
	keywords = {cyber security, jaccard similarity, lempel-ziv, malware classification, normalized compression distance},
}