-1 dbscan category
foongminwong opened this issue · 1 comments
foongminwong commented
Hi, I was trying to run dbscan on some texts and create a scatterplot.
I wonder why my dbscan_labels
has a -1 category (not sure what it means):
documents['dbscan_labels'] = (
documents['tfidf']
.pipe(hero.dbscan)
.astype(str)
)
hero.scatterplot(df=documents, col='pca', color='dbscan_labels', hover_data=['ID', 'Title'], title=" DBScan Clustering (Test) - Texthero library")
I tried running using k-means previously and the clusters/scatter plot look good:
documents['tfidf'] = (
documents['Text']
.pipe(hero.clean)
.pipe(hero.tfidf)
)
documents['kmeans_labels'] = (
documents['tfidf']
.pipe(hero.kmeans, n_clusters=13)
.astype(str)
)
documents['pca'] = documents['tfidf'].pipe(hero.pca)
hero.scatterplot(df=documents, col='pca', color='kmeans_labels', hover_data=['ID', 'Title'], title="K-Means Clustering (Test) - Texthero library")
Thank you!
jbesomi commented
Hi @foongminwong, thank you for reaching out!
DBSCAN classify points into different classes, one of which is "noise point" / outliers. -1 indicates that these points have been classified as such from your DB algorithm.
We will need to update the docstring of the texthero.representation.dbscan function and make it more explicit. Would you like to help us with that?