Topic Modeling TED Talks

Project Goals

Explore natural language processing techniques by creating a topic model using TED talk transcripts
Develop an application for topic modeling by creating a simple TED talk recommender
Develop an interactive front end to showcase the data, exploratory data analysis, topic modeling, and recommender

Project Overview

Final presentation hosted on Google Slides here.
Final deliverable was an interactive app deployed on Heroku, created using Streamlit. The app allowed you to explore the data, exploratory data analysis, algorithms, and use the recommender.
- You can also run locally using streamlit run interface.py.

Part I: Topic Modeling and Natural Language Processing

TED talks are currently categorized under hundreds of topics. In fact, on the TED website itself, a wide range of topics are listed here, from niche topics like "biomimicry" to general ideas like "big problems." This project began by using natural language processing and unsupervised learning to create a smaller set of topics with which to categorize Ted Talks.

Part II: Recommendation System

Next, I created and deployed a simple recommendation systems using the topics modeled above to recommend TED talks based on Jensen-Shannon divergence. Currently, you can input a title or an index and the system will generate the topic distribution, summary, link, and existing tags provided by TED for the talk, along with the 5 most similar and 5 most dissimilar talks.

Part III: Interactive Front End

Finally, I used Streamlit to develop an interactive interface to view exploratory data analysis and deliverables of project. I then used Heroku to deploy the app online. The user can use the sidebar to explore the algorithms used, exploratory data analysis, and customize some visualizations as well. The user can also generate talk topic distributions and TED talk recommendations based on the talk title or random index.

Data & Methods

Information about 4,200+ TED talks were web scraped from the official TED website. The dataset only includes the TED talks available from the quicklist of all TED talks here.
- Topic modeling focused on the 3,600+ TED talks with transcripts
Created a custom tokenizer (see tokenizer.py) utilizing NLTK and SpaCy packages.
- Added missing spaces after punctuation
- Removed parenthetical phrases about "applause," "laughter," and "music"
- Handled cases with numbers in hyphened words and with commas
- Removed punctuation, stop words, and lower-cased all tokens
- Exclusively looked at unigrams
After applying the tokenizer to the 3,600+ TED talks, had 2,300,000+ tokens (52,000+ unique tokens).

Used a Count Vectorizer to exclude tokens that occurred in more than 70% of the transcripts, and in fewer than 4 transcripts.
- 15,000+ unique tokens (15,000+ features) used in the final model
After vectorizing the tokenized transcripts, input data into Latent Dirichlet Allocation (LDA)
- Tested 10, 13, 15, 16, 17, 20, and 25 topics
- Evaluated using 80:20 train-test-split, log-likelihood, perplexity, and human readability
  - Although a low log-likelihood and high perplexity is what many aim for, they have been found at times to result in topics that are counter to human readability. For this project, I focused on the latter.

Number of Topics	Train Log-Likelihood	Test Log-Likelihood	Train Perplexity	Test Perplexity
10	-1.26e+07	-3.27e+06	2700.88	3764.29
15	-1.26e+07	-3.28e+06	2722.52	3853.54
20	-1.26e+07	-3.28e+06	2694.63	3895.46
25	-1.26e+07	-3.29e+06	2716.94	3940.12

Final model used 15 topics
- The following table shows the label for each topic that I assigned based on the most salient words, along with the 5 most salient words associated with each topic

Assigned Topic Name	#1 Word	#2 Word	#3 Word	#4 Word	#5 Word
General	life	tell	brain	human	talk
Science	light	energy	earth	planet	space
Technology	cell	brain	human	technology	system
Politics	political	government	power	medium	war
Problems	country	percent	change	company	problem
Personal	day	tell	story	life	love
AI	robot	machine	computer	build	game
Miscellaneous	socket	tk	amputate	prosthesis	amputee
Healthcare	patient	cancer	health	disease	doctor
Linguistics/Healthcare	language	word	book	write	read
Space	satellite	rocket	orbit	space	launch
Agriculture/Nature	food	eat	plant	farmer	animal
Gender/Sexuality	woman	man	sex	female	male
Audio/Visual	music	sound	play	voice	hear
Urban Planning/Design	city	design	building	build	place

Transformed entire data set using LDA model that was fit on training data

Dominant Topic: largest topic in a talk
Secondary Topic: second largest topic in a talk
Tertiary Topic: third largest topic in a talk
Used resulting document-topic matrix as the input to calculate Jensen-Shannon divergence, which takes in probability distributions, to calculate similarity between TED talks

Deliverables & Insights

TED talks are usually framed around solving a personal problem
TED talks are meant to appeal to a general audience
Although TED has been around for several decades, around early 2000’s developed a strong, wide-known image
Tokenizing and processing transcripts is error-prone
Can get TED talk recommendations by using a dropdown to search for titles

The app will then show you the topic distribution, summary, assigned TED tags, and link to watch the talk, as displayed in the gifs

The app will then show the top 5 most similar and most similar talks, along with their summaries, tags, and links

Future Work

Word Embeddings & n-grams
Recommend based on topic or keywords
Applications of topic modeling
1. Semantic analysis
2. TED talks over time
3. Predicting views or audience engagement
Improved frontend

rweng18/ted_talks