/coronavirusupdate

R data package of German NDR Coronavirusupdate podcast

Primary LanguageROtherNOASSERTION

coronavirusupdate

When the NDR Coronavirus-Update announced that they would make transcripts of all episodes publicly available I firstly tried to scrape the PDF files, which kind of worked but was not very reliable. So I went back and now used the transcripts directly from the podcast’s homepage.
Consequently I decided to create a R package to make it easy for anyone to scrape and access the data.
In this repo every Wednesday a GitHub action is triggered to get the newest transcripts from NDR.de (as long as the website doesn’t change).
For all non-R users you can also find the data as parquet file in data/coronavirusupdate_transcripts.parquet.

Installation

You can install this package from GitHub with:

# install.packages("devtools")
devtools::install_github("quickcoffee/coronavirusupdate")

Example

This is a basic example how to load the transcript data:

library(coronavirusupdate)
## load data
data("coronavirusupdate_transcripts")

This will load a tidy dataframe called coronavirus_update with one row per paragraph. Currently this includes 114 episodes with the following structure:

library(dplyr)
glimpse(coronavirusupdate_transcripts)
#> Rows: 10,295
#> Columns: 8
#> $ title            <chr> "Mit BA.4 und BA.5 in den Sommer", "Mit BA.4 und BA.5…
#> $ link             <chr> "https://www.ndr.de/nachrichten/info/114-Coronavirus-…
#> $ episode_no       <int> 114, 114, 114, 114, 114, 114, 114, 114, 114, 114, 114…
#> $ speaker          <chr> "Beke Schulmann", "Sandra Ciesek", "Sandra Ciesek", "…
#> $ text             <chr> "Die Zahl der Neuinfektionen steigt wieder nach einer…
#> $ paragraph_no     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
#> $ last_change      <dttm> 2022-06-17 17:00:00, 2022-06-17 17:00:00, 2022-06-17…
#> $ duration_episode <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

With this dataset one could for example inspect the share per speaker and episode:

library(ggplot2)

coronavirusupdate_transcripts %>%
  mutate(paragraph_length = nchar(text)) %>% 
  group_by(episode_no, speaker) %>% 
  summarise(speaker_length = sum(paragraph_length)) %>% 
  group_by(episode_no) %>% 
  mutate(speaker_share = speaker_length/sum(speaker_length)) %>% 
  ggplot(aes(x=episode_no, y=speaker_share, fill = speaker))+
  geom_bar(position="stack", stat="identity")+
  theme_minimal()+
  theme(legend.position="bottom")+
  labs(title = "Share of speaker per episode",
       y = "Share",
       x = "Episode No",
       fill = element_blank())

Or one can create a tf-idf analysis to find out which words are “unique” to each speaker:

library(tidytext) #for text processing and analysis
library(ggplot2) # for plotting

# create tibble with stopwords from stopwords package and some very frequent words in the corpus
stop_words_de <- tibble(word = c(stopwords::stopwords("de"),
                                 "dass",
                                 "schon",
                                 "mal",
                                 "ja",
                                 "ganz",
                                 "gibt",
                                 "sagen"))

# get top 6 speaker by paragraph count
top_speaker <- coronavirusupdate_transcripts %>% 
  count(speaker, sort = TRUE) %>% 
  head(6) %>% 
  pull(speaker)

tidy_corona <- coronavirusupdate_transcripts %>%
  filter(speaker %in% top_speaker) %>%
  select(episode_no, speaker, text) %>%
  unnest_tokens(word, text) %>% 
  anti_join(stop_words_de)

#get tf-idf per speaker
tidy_corona_tf_idf <- tidy_corona %>%
  count(speaker, word) %>% 
  ungroup() %>% 
  bind_tf_idf(word, speaker, n)


tidy_corona_tf_idf %>% 
  group_by(speaker) %>% 
  slice_max(tf_idf, n = 10, with_ties = FALSE) %>% 
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  ggplot(aes(tf_idf, word, fill = speaker)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~speaker, ncol = 2, scales = "free") +
  labs(title = "Highest tf-idf words of top 6 speakers from the NDR Coronavirusupdate",
       caption = "Transcripts from NDR.de",
       x = "tf-idf", y = NULL)+
  theme_minimal()