/TedClustering

cluster Ted corpus; visualize in word-cloud; reduce by t-sne

Primary LanguagePython

TedClustering

An attempt of using clusering algorithms to explore TED corpus. The data is from https://www.kaggle.com/rounakbanik/ted-talks.

Used features: TF, TF-IDF, LSA

Clustering algorithms: K-means, MiniBatchK-means, hierarchical clustering, DBSCAN, iforest (abnormality detection)

algorithm entropy
MiniBatchKMeans 5.06
KMeans 4.82
hierarchical clustering average link 5.28
hierarchical clustering complete link 5.17
hierarchical clustering ward link 4.85

Below shows how lsa affect the result

alt text

Wordcloud for clusters 0-9

alt text alt text alt text alt text alt text alt text alt text alt text alt text alt text

Tsne Project of clusters 0-9

alt text

Abnormality detection by iforest ( the most distinctive Ted talks), TFIDF + LSA

score tile
-0.039 An 8-dimensional model of the universe
-0.038 Debate: Does the world need nuclear energy?
-0.025 Does democracy stifle economic growth?
-0.021 Why bees are disappearing
-0.018 How we're growing baby corals to rebuild reefs
-0.015 Our refugee system is failing. Here's how we can fix it
-0.012 The laws that sex workers really want
-0.009 How fear of nuclear power is hurting the environment
-0.008 The refugee crisis is a test of our character
-0.007 Why I still have hope for coral reefs