TypeError: unhashable type: 'Int64Index'
zhijcao opened this issue · 8 comments
Hi Ben,
Thank you for your great work!
I generated topic models with 5 different topic number on the same corpus and dictionary. I can use pyLDAvis to visualize
four of them, but one got an error. Would you please like to help me with this error. I got this error on both new and old version of pyLDAvis.
Best,
Zhijun
ERROR information
TypeError Traceback (most recent call last)
in
----> 1 vismallet = gensimvis.prepare(models[3], corpus, dictionary=id2word, sort_topics=False)
~\AppData\Roaming\Python\Python38\site-packages\pyLDAvis\gensim_models.py in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
121 """
122 opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
--> 123 return pyLDAvis.prepare(**opts)
~\AppData\Roaming\Python\Python38\site-packages\pyLDAvis_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics, start_index)
437 term_frequency = np.sum(term_topic_freq, axis=0)
438
--> 439 topic_info = _topic_info(topic_term_dists, topic_proportion,
440 term_frequency, term_topic_freq, vocab, lambda_step, R,
441 n_jobs, start_index)
~\AppData\Roaming\Python\Python38\site-packages\pyLDAvis_prepare.py in _topic_info(topic_term_dists, topic_proportion, term_frequency, term_topic_freq, vocab, lambda_step, R, n_jobs, start_index)
278 for ls in _job_chunks(lambda_seq, n_jobs)))
279 topic_dfs = map(topic_top_term_df, enumerate(top_terms.T.iterrows(), start_index))
--> 280 return pd.concat([default_term_info] + list(topic_dfs))
281
282
~\AppData\Roaming\Python\Python38\site-packages\pyLDAvis_prepare.py in topic_top_term_df(tup)
262 def topic_top_term_df(tup):
263 new_topic_id, (original_topic_id, topic_terms) = tup
--> 264 term_ix = topic_terms.unique()
265 df = pd.DataFrame({'Term': vocab[term_ix],
266 'Freq': term_topic_freq.loc[original_topic_id, term_ix],
~\AppData\Roaming\Python\Python38\site-packages\pandas\core\series.py in unique(self)
1870 Categories (3, object): ['a' < 'b' < 'c']
1871 """
-> 1872 result = super().unique()
1873 return result
1874
~\AppData\Roaming\Python\Python38\site-packages\pandas\core\base.py in unique(self)
1045 result = np.asarray(result)
1046 else:
-> 1047 result = unique1d(values)
1048
1049 return result
~\AppData\Roaming\Python\Python38\site-packages\pandas\core\algorithms.py in unique(values)
405
406 table = htable(len(values))
--> 407 uniques = table.unique(values)
408 uniques = _reconstruct_data(uniques, original.dtype, original)
409 return uniques
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.unique()
pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()
~\AppData\Roaming\Python\Python38\site-packages\pandas\core\indexes\base.py in hash(self)
4271 @Final
4272 def hash(self):
-> 4273 raise TypeError(f"unhashable type: {repr(type(self).name)}")
4274
4275 @Final
TypeError: unhashable type: 'Int64Index'
I had the same error solved by reducing the number of topics.
What version are you using? I suggest using vs 3.3.1 and upgrading all pip
packages to the latest possible
FWIW, I ran into a similar issue with Python 3.8 and vs 3.3.1 in a situation where the original K of my model is greater than the resulting number of clusters. I've been driving myself insane trying to find a work around as my underlying data is pretty noisy, so if I reduce K to not have empty clusters a lot of junk ends up being spread around instead of being dumped into just a few clusters. Wish I could just go without the viz, but our communications team finds it really useful as a top line scan of recent twitter chatter.
I tried implementing the following, borrowing from here, but that just got me to a different error (TypeError: Object of type 'complex' is not JSON serializable
) which I worked around by amending the utils.py per here, but then find myself with yet another error ( which worked after reloading my virtual environment, though now the clusters are all even more tightly packed.AttributeError: module 'numpy' has no attribute 'Int64Index'
). So long story short... I'm stumped!
#attempted work around for empty clusters
def prepare_data(mgp):
vocabulary = list(vocab)
doc_topic_dists = [mgp.score(doc) for doc in docs]
for doc in doc_topic_dists:
for f in doc:
assert not isinstance(f, complex)
doc_lengths = [len(doc) for doc in docs]
term_counts_map = {}
for doc in docs:
for term in doc:
term_counts_map[term] = term_counts_map.get(term, 0) + 1
term_counts = [term_counts_map[term] for term in vocabulary]
doc_topic_dists2 = [[v if not math.isnan(v) else 1/K for v in d] for d in doc_topic_dists]
doc_topic_dists2 = [d if sum(d) > 0 else [1/K]*K for d in doc_topic_dists2]
for doc in doc_topic_dists2:
for f in doc:
assert not isinstance(f, complex)
assert (pd.DataFrame(doc_topic_dists2).sum(axis=1) < 0.999).sum() == 0
matrix = []
for cluster in mgp.cluster_word_distribution:
total = sum([occurance for word, occurance in cluster.items()])
assert not math.isnan(total)
# assert total > 0
if total == 0:
row = [(1 / len(vocabulary))] * len(vocabulary)
else:
row = [cluster.get(term, 1) / total for term in vocabulary] # <--- Modified this to be (term, 1) instead of (term, 0)
for f in row:
assert not isinstance(f, complex)
matrix.append(row)
return matrix, doc_topic_dists2, doc_lengths, vocabulary, term_counts
def prepare_visualization_data(mgp):
vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
return vis_data
vis_data = prepare_visualization_data(mgp)
pyLDAvis.save_html(vis_data, 'sttmChart.html')
Happy to post/share my full code if it's helpful. Thanks!
I'm having the same error. Reducing the number of topics < 10 solves the issue though, but this def not optimal.
NB - I'm using LDAMallet that wrapped by means of "gensim.models.wrappers.ldamallet.malletmodel2ldamodel"
I am also running into this issue. Following are the steps to reproduce it. Happy to provide more details if necessary. I am using pyLDAvis 3.3.1
I used the following 5 lines as documents to train a topic model for 5 topics.
I ate dinner
We had a three course meal
In the end we all felt like we ate too much
We all agreed it was a magnificent evening
He loves fish tacos
data = [['I', 'ate', 'dinner'], ['We', 'had', 'a', 'three', 'course', 'meal'], ['In', 'the', 'end', 'we', 'all', 'felt', 'like', 'we', 'ate', 'too', 'much'], ['We', 'all', 'agreed', 'it', 'was', 'a', 'magnificent', 'evening'], ['He', 'loves', 'fish', 'tacos']]
id2word = corpora.Dictionary(data) # gensim.corpora
texts = data
corpus = [id2word.doc2bow(text1) for text1 in texts]
lda_mallet_model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=5, id2word=id2word, random_seed = 41) # I am using gensim-version 3.8.3
gensim_model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(lda_mallet_model)
pyldaVis_prepared_model = pyLDAvis.gensim_models.prepare(gensim_model, corpus, id2word) # this lines gives the error
The error is:
File "/usr/local/lib/python3.9/site-packages/pyLDAvis/gensim_models.py", line 123, in prepare
return pyLDAvis.prepare(**opts)
File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 439, in prepare
topic_info = _topic_info(topic_term_dists, topic_proportion,
File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 280, in _topic_info
return pd.concat([default_term_info] + list(topic_dfs))
File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 264, in topic_top_term_df
term_ix = topic_terms.unique()
File "/usr/local/lib/python3.9/site-packages/pandas/core/series.py", line 1872, in unique
result = super().unique()
File "/usr/local/lib/python3.9/site-packages/pandas/core/base.py", line 1047, in unique
result = unique1d(values)
File "/usr/local/lib/python3.9/site-packages/pandas/core/algorithms.py", line 407, in unique
uniques = table.unique(values)
File "pandas/_libs/hashtable_class_helper.pxi", line 4719, in pandas._libs.hashtable.PyObjectHashTable.unique
File "pandas/_libs/hashtable_class_helper.pxi", line 4666, in pandas._libs.hashtable.PyObjectHashTable._unique
File "/usr/local/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4273, in __hash__
raise TypeError(f"unhashable type: {repr(type(self).__name__)}")
TypeError: unhashable type: 'Int64Index'
I can reproduce the error now...work in progress TBC
import pyLDAvis
from pyLDAvis import gensim_models
import gensim
data = [['I', 'ate', 'dinner'], ['We', 'had', 'a', 'three', 'course', 'meal'], ['In', 'the', 'end', 'we', 'all', 'felt', 'like', 'we', 'ate', 'too', 'much'], ['We', 'all', 'agreed', 'it', 'was', 'a', 'magnificent', 'evening'], ['He', 'loves', 'fish', 'tacos']]
id2word = gensim.corpora.Dictionary(data)
texts = data
corpus = [id2word.doc2bow(text1) for text1 in texts]
mallet_path = '/Users/msusol/DATASCIENCE/mallet-2.0.8/bin/mallet'
lda_mallet_model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=5, id2word=id2word, random_seed = 41)
gensim_model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(lda_mallet_model)
pyldaVis_prepared_model = gensim_models.prepare(gensim_model, corpus, id2word) # this lines gives the error
--> 282 return pd.concat([default_term_info] + list(topic_dfs))
ipdb> pd.concat([default_term_info] + list(topic_dfs))
*** TypeError: unhashable type: 'Int64Index'
for a working example (pyLDAvis_overview.ipynb), we get
ipdb> top_terms
topic 18 0 10 16 13 14 12 9 4 6 19 15 \
0 30 2821 1712 3493 2423 657 985 3841 1227 3297 814 759
1 43 3987 3474 3903 3067 660 3556 4048 1878 3499 2691 3346
2 52 4065 6450 4036 3398 650 4346 4131 2659 3710 3364 4675
3 73 4314 7379 4527 3675 3201 4395 4852 2745 4360 3920 4765
4 77 4364 7703 5610 3849 3725 4955 5248 2892 4497 1146 5889
.. .. ... ... ... ... ... ... ... ... ... ... ...
25 34 427 36 297 255 413 87 1977 1878 1154 971 651
26 32 14 236 566 905 5 873 195 977 212 45 1239
27 40 189 8 186 35 1170 378 212 1309 2056 2078 118
28 43 967 74 498 244 1431 392 1922 128 35 222 901
29 9 271 253 119 94 706 755 296 1868 713 2086 331
topic 17 8 2 7 1 11 3 5
0 1973 904 2436 1749 2224 3553 2287 1910
1 2489 1562 2666 2636 2309 3665 2377 2190
2 3516 1905 2818 3596 2813 3736 2577 3412
3 3966 2024 2834 4352 3740 4122 3612 4984
4 3977 824 4223 4842 4917 4445 3875 5010
.. ... ... ... ... ... ... ... ...
25 286 1830 2436 728 2846 231 114 719
26 923 1446 27 1300 220 1485 2621 2990
27 25 11 962 2644 2501 23 214 813
28 1253 173 156 45 74 1247 178 101
29 140 2375 87 70 242 1514 244 1100
[3030 rows x 20 columns]
ipdb> pd.concat([default_term_info] + list(topic_dfs))
Term Freq Total Category logprob loglift
1 movie 5510.000000 5510.000000 Default 30.0000 30.0000
0 film 8913.000000 8913.000000 Default 29.0000 29.0000
2 good 2399.000000 2399.000000 Default 28.0000 28.0000
7 character 1922.000000 1922.000000 Default 27.0000 27.0000
4 story 2130.000000 2130.000000 Default 26.0000 26.0000
... ... ... ... ... ... ...
813 mike 24.886436 119.708520 Topic20 -5.4332 2.9934
1100 fbi 23.930001 93.158293 Topic20 -5.4724 3.2049
305 fans 25.842871 240.489458 Topic20 -5.3955 2.3335
25 big 27.755741 1055.640041 Topic20 -5.3241 0.9257
101 horror 23.930001 460.164805 Topic20 -5.4724 1.6077
[1496 rows x 6 columns]
your model produces
ipdb> top_terms
0 1 2 3 4
2 Int64Index([2, 7, 11, 12, 18, 25, 1], dtype='i... NaN NaN NaN NaN
0 Int64Index([1], dtype='int64', name='term') NaN NaN NaN NaN
4 Int64Index([19, 24], dtype='int64', name='term') NaN NaN NaN NaN
3 Int64Index([21], dtype='int64', name='term') NaN NaN NaN NaN
1 Int64Index([26], dtype='int64', name='term') NaN NaN NaN NaN
.. ... .. .. .. ..
2 Int64Index([1, 2, 7, 11, 12, 18, 25], dtype='i... NaN NaN NaN NaN
0 Int64Index([1], dtype='int64', name='term') NaN NaN NaN NaN
4 Int64Index([19, 24], dtype='int64', name='term') NaN NaN NaN NaN
3 Int64Index([21], dtype='int64', name='term') NaN NaN NaN NaN
1 Int64Index([26], dtype='int64', name='term') NaN NaN NaN NaN
[1099 rows x 5 columns]
ipdb> pd.concat([default_term_info] + list(topic_dfs))
*** TypeError: unhashable type: 'Int64Index'
Hi Mark, I was wondering - did you manage to find some time to look into the above? Many thanks & best regards, Mike
Hi, I believe the problem is that word_topics
array in lda_mallet_model
consists some elements of 0.
On line 258-259 in _prepare.py,
log_lift = np.log(pd.eval("topic_term_dists / term_proportion")).astype("float64")
log_ttd = np.log(pd.eval("topic_term_dists")).astype("float64")
when pyldavis calculate log_lift
and log_ttd
, some of the elements in topic_term_dists
are 0 and np.log(0) gives -inf.
Then, on line 217-219 in _prepare.py,
def _find_relevance(log_ttd, log_lift, R, lambda_):
relevance = lambda_ * log_ttd + (1 - lambda_) * log_lift
return relevance.T.apply(lambda topic: topic.nlargest(R).index)
when it calculates relevance for different lambda_
values (0, 0.01, 0.02, ... ,1), for lambda_=0
and lambda_=1
it becomes problematic because 0*-inf gives nan. By default, pandas will ignore the nan values when it's finding the nlargest terms in a given topic. As a result, it returns series of different lengths, which is why you get top_terms
like this
ipdb> top_terms
0 1 2 3 4
2 Int64Index([2, 7, 11, 12, 18, 25, 1], dtype='i... NaN NaN NaN NaN
0 Int64Index([1], dtype='int64', name='term') NaN NaN NaN NaN
4 Int64Index([19, 24], dtype='int64', name='term') NaN NaN NaN NaN
3 Int64Index([21], dtype='int64', name='term') NaN NaN NaN NaN
1 Int64Index([26], dtype='int64', name='term') NaN NaN NaN NaN
.. ... .. .. .. ..
2 Int64Index([1, 2, 7, 11, 12, 18, 25], dtype='i... NaN NaN NaN NaN
0 Int64Index([1], dtype='int64', name='term') NaN NaN NaN NaN
4 Int64Index([19, 24], dtype='int64', name='term') NaN NaN NaN NaN
3 Int64Index([21], dtype='int64', name='term') NaN NaN NaN NaN
1 Int64Index([26], dtype='int64', name='term') NaN NaN NaN NaN
Finally , when we call top_terms.unique()
in line 264, it gives the error
File "/usr/local/lib/python3.9/site-packages/pyLDAvis/gensim_models.py", line 123, in prepare
return pyLDAvis.prepare(**opts)
File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 439, in prepare
topic_info = _topic_info(topic_term_dists, topic_proportion,
File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 280, in _topic_info
return pd.concat([default_term_info] + list(topic_dfs))
File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 264, in topic_top_term_df
term_ix = topic_terms.unique()
File "/usr/local/lib/python3.9/site-packages/pandas/core/series.py", line 1872, in unique
result = super().unique()
File "/usr/local/lib/python3.9/site-packages/pandas/core/base.py", line 1047, in unique
result = unique1d(values)
File "/usr/local/lib/python3.9/site-packages/pandas/core/algorithms.py", line 407, in unique
uniques = table.unique(values)
File "pandas/_libs/hashtable_class_helper.pxi", line 4719, in pandas._libs.hashtable.PyObjectHashTable.unique
File "pandas/_libs/hashtable_class_helper.pxi", line 4666, in pandas._libs.hashtable.PyObjectHashTable._unique
File "/usr/local/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4273, in __hash__
raise TypeError(f"unhashable type: {repr(type(self).__name__)}")
TypeError: unhashable type: 'Int64Index'
Given that this problem is not specific to LDA Mallet (since any model which has 0 in topic_term_dist
will have the same problem), I think the easiest fix is to replace the 0 in topic_term_dists
with values close to 0 but not exactly 0 such as 1e-10 to avoid getting -inf at the first place.
I suggest to modify line 258-259 in _prepare.py to
# to avoid -inf when calculating log_lift and log_ttd
topic_term_dists_non_zero = topic_term_dists.replace(0,1e-10)
log_lift = np.log(pd.eval("topic_term_dists_non_zero / term_proportion")).astype("float64")
log_ttd = np.log(pd.eval("topic_term_dists_non_zero")).astype("float64")