/Ted-Talks-Clustering

Clustering over +4000 Ted Talks using the most common clustering algorithms and a comparaison between tf-idf and Word Embeddings.

Primary LanguageJupyter Notebook

Ted Talks Clustering

Clustering over +4000 Ted Talks using the most common clustering algorithms and a comparaison between tf-idf and Word Embeddings. Data from: https://www.kaggle.com/datasets/miguelcorraljr/ted-ultimate-dataset

  • Comparison of clustering formed using tf-idf and word embeddings using the most commons clustering algorithms like KMeans, Gaussian Mixture Models and Agglomerative Clustering.
  • Tuning of the hyperparameters of all models.
  • Comparaison of the results using multiple clustering metrics (DBI, Silhoutte and Calinski).
  • Bonus experiment using only the most relevant tf-idf words and partly solving the curse of dimensionality.
  • Bonus experiment using word embeddings from Microsoft MiniLM-L12-H384.
  • Final analysis using wordclouds and n-grams to identify the topics.
  • Found insights about which algorithms and metrics work best for document clustering and why.
  • I used cuML, Spark (PySpark) and sentence-transformers.

Wordclouds generated from KMeans with Bert Embeddings:

Wordclouds generated from Gaussian Mixture Models with Bert Embeddings: