TedClustering

An attempt of using clusering algorithms to explore TED corpus. The data is from https://www.kaggle.com/rounakbanik/ted-talks.

Used features: TF, TF-IDF, LSA

Clustering algorithms: K-means, MiniBatchK-means, hierarchical clustering, DBSCAN, iforest (abnormality detection)

algorithm	entropy
MiniBatchKMeans	5.06
KMeans	4.82
hierarchical clustering average link	5.28
hierarchical clustering complete link	5.17
hierarchical clustering ward link	4.85

Below shows how lsa affect the result

score	tile
-0.039	An 8-dimensional model of the universe
-0.038	Debate: Does the world need nuclear energy?
-0.025	Does democracy stifle economic growth?
-0.021	Why bees are disappearing
-0.018	How we're growing baby corals to rebuild reefs
-0.015	Our refugee system is failing. Here's how we can fix it
-0.012	The laws that sex workers really want
-0.009	How fear of nuclear power is hurting the environment
-0.008	The refugee crisis is a test of our character
-0.007	Why I still have hope for coral reefs