relatio-nlp/relatio

improvements / alternatives to clustering

Opened this issue · 1 comments

-- could also allow for custom initialized cluster centroids
-- allow for clustering based on cosine-similarity thresholds, to the centroid, or to the closest member of the cluster.
-- replace the arora et al embeddings with S-BERT embeddings
-- allow for stretching the space along an antonyms dimension
-- drop all names as stopwords
-- drop patients that contain a verb
-- make clustering on the list of entity phrases, rather than the set, an option. that is, add sample_weight=n_mentions to the k-means .fit() function. could also weight by log of n_mentions.