/tv-show-summarization-twitter

Topic modeling and timeseries vizualization for summarization of live TV shows via Twitter

Primary LanguageJupyter Notebook

Latent Dirichlet allocation and timeseries analysis for summarization of live TV shows via Twitter


Watching TV is usually accompanied with comments about the content. We tend to addresss these comments to nearby people or online friends. In this empirical study we retrieved 30k Twitter status updates during popular TV talk shows. Topic modeling analysis allows us to separate themes and, eventually, summarize long TV broadcasts automatically.

Tweets volume change during talkshow Enikos (18/4/2016) Top LDA topics during talkshow Anatropi (12/4/2016)
PICTURE PICTURE

We provide the source code for our analysis, the pdf report and a Jupyter Notebook with both input and output. You can also watch the presentation here.


Algorithm walkthrough

  1. Read tweets from tweepy retrieved hashtag json/txt.
  2. Load to pandas DataFrame and keep only relevant columns (tweets and timestamp in our case).
  3. (Optional) Transform timestamp to local timezone (Athens time in our case, advise pytz to adjust).
  4. Count tweet volume per minute.
  5. Plot time series.
  6. (Optional) Remove stopwords, find most frequent tokens (Greek stopwords in our case but every NLTK supported language works).
  7. LDA Preprocessing, words occurring in only one document or in at least 95% of the documents are removed.
  8. Document Term Matrix structure transform.
  9. Obtain the words with high probabilities.
  10. Obtain the feature names.
  11. Print LDA topics, assign topics to each tweet.

Dependencies

  • Python 2.7+
  • Scikit-learn
  • Pandas
  • Numpy
  • Vincent
  • NLTK
  • LDA