Understanding COVID-19 from Research Articles using Text Analytics in MATLAB

COVID-19 Open Research Dataset (CORD-19), offered by Allen Institute for AI and other leading research groups, is a collection of thousands of articles related to COVID-19 and related coronaviruses. Here, we are using text analytics techniques in MATLAB to explore the articles and use topic modeling and document summarization to answer some of the relevant questions.


Goal: Explore relevant articles to understand “what do we know about transmission?”

Data used: comm_use_subset from the dataset hosted at COVID-19 Open Research Dataset (CORD-19)

Techniques used: Topic modeling and Document Summarization

MATLAB Live Script: TopicModel_Transmission_comm_use.mlx

Wordcloud of Titles


Step 1: First, we use a latent Dirichlet allocation (LDA) method to perform topic modeling to discover underlying topics in the articles. We test four different solvers:

  • ‘cgs’: collapsed Gibbs sampling
  • ‘avb’: approximate variational Bayes
  • ‘cvb0’: variational Bayes, zeroth order
  • ‘savb’: stochastic approximate variational Bayes

Validation Perplexity for Solvers

Step 2: After choosing a solver, we then choose the optimum number of topics by comparing validation perplexities for different numbers of topics.

Validation Perplexity for Number of Topics

Step 3: Build the final model using the chosen solver and optimum number of topics.

Final Model Sample Topics

Step 4: In order to answer the question “what do we know about transmission?”, we choose the most relevant article by identifying the topic with the word, “transmission”, having highest probability and then identifying the document in that topic that has the highest probability.

Step 5: An alternate approach is to summarize the top abstracts.

Next Steps

There are many ways to dig deeper into this single question. Some of the possible approaches are:


The aim of this example is to show

  • how to use text analytics techniques to explore text data and build predictive models, and
  • provide a starting point for researchers to build on it and dive deeper into unanswered questions regarding this pandemic.

For more information on Text Analytics using MATLAB:


