improvements / alternatives to clustering
Opened this issue · 1 comments
elliottash commented
-- could also allow for custom initialized cluster centroids
-- allow for clustering based on cosine-similarity thresholds, to the centroid, or to the closest member of the cluster.
-- replace the arora et al embeddings with S-BERT embeddings
-- allow for stretching the space along an antonyms dimension
-- drop all names as stopwords
-- drop patients that contain a verb
-- make clustering on the list of entity phrases, rather than the set, an option. that is, add sample_weight=n_mentions
to the k-means .fit()
function. could also weight by log of n_mentions
.
elliottash commented
another possible approach: https://towardsdatascience.com/clustering-sentence-embeddings-to-identify-intents-in-short-text-48d22d3bf02e