Colab notebook crashing while calculating PCA/K-Means. CSV file contains 80,000+ rows!
Opened this issue · 3 comments
Hello,
I'm trying to visualize Kmeans for the dataset I have which has 80K+ rows with 9 columns.
The notebook keeps crashing whenever I try to run this particular code:
#Add pca value to dataframe to use as visualization coordinates
df1['pca'] = (
df1['clean_tweet']
.pipe(hero.tfidf)
.pipe(hero.pca)
)
#Add k-means cluster to dataframe
df1['kmeans'] = (
df1['clean_tweet']
.pipe(hero.tfidf)
.pipe(hero.kmeans)
)
df1.head()
Is it because texthero can't handle that many rows yet?
Any other solution?
Same happened even for me, and I assumed its because of "large data and Colab"
Hi!
This is a known (current) limitation of Texthero. It will be fixed soon in the next releases (Texthero is still in Beta).
The problem arises on the tfidf part, by default max_features is None, meaning a giant matrix doc-term occurrences is created. This is by default sparse, but as of now Texthero convert it into a dense matrix (to be saved as a Pandas Series of list and to be passed into pca)
For now, you should be able to solve the problem by replacing ".pipe(hero.tfidf)" with ".pipe(hero.tfidf, max_features=300)" (any value between 100-300 is ok)
Let me know if that works, in future releases we will develop a different solution that will return a Sparse Pandas Series, see #43
Hi!
This is a known (current) limitation of Texthero. It will be fixed soon in the next releases (Texthero is still in Beta).
The problem arises on the tfidf part, by default max_features is None, meaning a giant matrix doc-term occurrences is created. This is by default sparse, but as of now Texthero convert it into a dense matrix (to be saved as a Pandas Series of list and to be passed into pca)
For now, you should be able to solve the problem by replacing ".pipe(hero.tfidf)" with ".pipe(hero.tfidf, max_features=300)" (any value between 100-300 is ok)
Let me know if that works, in future releases we will develop a different solution that will return a Sparse Pandas Series, see #43
[UPDATE]
I tried setting max_features=300
and it worked for 80k+ tweets!
This is a workaround for now.