jbesomi/texthero

-1 dbscan category

foongminwong opened this issue · 1 comments

Hi, I was trying to run dbscan on some texts and create a scatterplot.

I wonder why my dbscan_labels has a -1 category (not sure what it means):

documents['dbscan_labels'] = (
    documents['tfidf']
    .pipe(hero.dbscan)
    .astype(str)
)

hero.scatterplot(df=documents, col='pca', color='dbscan_labels', hover_data=['ID', 'Title'], title=" DBScan Clustering (Test) - Texthero library")

image

I tried running using k-means previously and the clusters/scatter plot look good:

documents['tfidf'] = (
    documents['Text']
    .pipe(hero.clean)
    .pipe(hero.tfidf)
)

documents['kmeans_labels'] = (
    documents['tfidf']
    .pipe(hero.kmeans, n_clusters=13)
    .astype(str)
)

documents['pca'] = documents['tfidf'].pipe(hero.pca)

hero.scatterplot(df=documents, col='pca', color='kmeans_labels', hover_data=['ID', 'Title'], title="K-Means Clustering (Test) - Texthero library")

image

Thank you!

Hi @foongminwong, thank you for reaching out!

DBSCAN classify points into different classes, one of which is "noise point" / outliers. -1 indicates that these points have been classified as such from your DB algorithm.

We will need to update the docstring of the texthero.representation.dbscan function and make it more explicit. Would you like to help us with that?