bmabey/pyLDAvis

TypeError: unhashable type: 'Int64Index'

zhijcao opened this issue · 8 comments

Hi Ben,
Thank you for your great work!
I generated topic models with 5 different topic number on the same corpus and dictionary. I can use pyLDAvis to visualize
four of them, but one got an error. Would you please like to help me with this error. I got this error on both new and old version of pyLDAvis.
Best,
Zhijun

ERROR information

TypeError Traceback (most recent call last)
in
----> 1 vismallet = gensimvis.prepare(models[3], corpus, dictionary=id2word, sort_topics=False)

~\AppData\Roaming\Python\Python38\site-packages\pyLDAvis\gensim_models.py in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
121 """
122 opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
--> 123 return pyLDAvis.prepare(**opts)

~\AppData\Roaming\Python\Python38\site-packages\pyLDAvis_prepare.py in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics, start_index)
437 term_frequency = np.sum(term_topic_freq, axis=0)
438
--> 439 topic_info = _topic_info(topic_term_dists, topic_proportion,
440 term_frequency, term_topic_freq, vocab, lambda_step, R,
441 n_jobs, start_index)

~\AppData\Roaming\Python\Python38\site-packages\pyLDAvis_prepare.py in _topic_info(topic_term_dists, topic_proportion, term_frequency, term_topic_freq, vocab, lambda_step, R, n_jobs, start_index)
278 for ls in _job_chunks(lambda_seq, n_jobs)))
279 topic_dfs = map(topic_top_term_df, enumerate(top_terms.T.iterrows(), start_index))
--> 280 return pd.concat([default_term_info] + list(topic_dfs))
281
282

~\AppData\Roaming\Python\Python38\site-packages\pyLDAvis_prepare.py in topic_top_term_df(tup)
262 def topic_top_term_df(tup):
263 new_topic_id, (original_topic_id, topic_terms) = tup
--> 264 term_ix = topic_terms.unique()
265 df = pd.DataFrame({'Term': vocab[term_ix],
266 'Freq': term_topic_freq.loc[original_topic_id, term_ix],

~\AppData\Roaming\Python\Python38\site-packages\pandas\core\series.py in unique(self)
1870 Categories (3, object): ['a' < 'b' < 'c']
1871 """
-> 1872 result = super().unique()
1873 return result
1874

~\AppData\Roaming\Python\Python38\site-packages\pandas\core\base.py in unique(self)
1045 result = np.asarray(result)
1046 else:
-> 1047 result = unique1d(values)
1048
1049 return result

~\AppData\Roaming\Python\Python38\site-packages\pandas\core\algorithms.py in unique(values)
405
406 table = htable(len(values))
--> 407 uniques = table.unique(values)
408 uniques = _reconstruct_data(uniques, original.dtype, original)
409 return uniques

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.unique()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable._unique()

~\AppData\Roaming\Python\Python38\site-packages\pandas\core\indexes\base.py in hash(self)
4271 @Final
4272 def hash(self):
-> 4273 raise TypeError(f"unhashable type: {repr(type(self).name)}")
4274
4275 @Final

TypeError: unhashable type: 'Int64Index'

KDYao commented

I had the same error solved by reducing the number of topics.

What version are you using? I suggest using vs 3.3.1 and upgrading all pip packages to the latest possible

FWIW, I ran into a similar issue with Python 3.8 and vs 3.3.1 in a situation where the original K of my model is greater than the resulting number of clusters. I've been driving myself insane trying to find a work around as my underlying data is pretty noisy, so if I reduce K to not have empty clusters a lot of junk ends up being spread around instead of being dumped into just a few clusters. Wish I could just go without the viz, but our communications team finds it really useful as a top line scan of recent twitter chatter.

I tried implementing the following, borrowing from here, but that just got me to a different error (TypeError: Object of type 'complex' is not JSON serializable) which I worked around by amending the utils.py per here, but then find myself with yet another error (AttributeError: module 'numpy' has no attribute 'Int64Index'). So long story short... I'm stumped! which worked after reloading my virtual environment, though now the clusters are all even more tightly packed.

#attempted work around for empty clusters

def prepare_data(mgp):
    vocabulary = list(vocab)
    doc_topic_dists = [mgp.score(doc) for doc in docs]
    for doc in doc_topic_dists:
        for f in doc:
            assert not isinstance(f, complex)

    doc_lengths = [len(doc) for doc in docs]
    term_counts_map = {}
    for doc in docs:
        for term in doc:
            term_counts_map[term] = term_counts_map.get(term, 0) + 1
    term_counts = [term_counts_map[term] for term in vocabulary]
    doc_topic_dists2 = [[v if not math.isnan(v) else 1/K for v in d] for d in doc_topic_dists]
    doc_topic_dists2 = [d if sum(d) > 0 else [1/K]*K for d in doc_topic_dists2]
    for doc in doc_topic_dists2:
        for f in doc:
            assert not isinstance(f, complex)

    assert (pd.DataFrame(doc_topic_dists2).sum(axis=1) < 0.999).sum() == 0

    matrix = []
    for cluster in mgp.cluster_word_distribution:
        total = sum([occurance for word, occurance in cluster.items()])
        assert not math.isnan(total)
        # assert total > 0
        if total == 0:
            row = [(1 / len(vocabulary))] * len(vocabulary)   
        else:
            row = [cluster.get(term, 1) / total for term in vocabulary] # <--- Modified this to be (term, 1) instead of (term, 0) 
        for f in row:
            assert not isinstance(f, complex)
        matrix.append(row)
    return matrix, doc_topic_dists2, doc_lengths, vocabulary, term_counts

def prepare_visualization_data(mgp):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
    return vis_data

vis_data = prepare_visualization_data(mgp)
pyLDAvis.save_html(vis_data, 'sttmChart.html')

Happy to post/share my full code if it's helpful. Thanks!

I'm having the same error. Reducing the number of topics < 10 solves the issue though, but this def not optimal.
NB - I'm using LDAMallet that wrapped by means of "gensim.models.wrappers.ldamallet.malletmodel2ldamodel"

I am also running into this issue. Following are the steps to reproduce it. Happy to provide more details if necessary. I am using pyLDAvis 3.3.1

I used the following 5 lines as documents to train a topic model for 5 topics.

I ate dinner
We had a three course meal
In the end we all felt like we ate too much
We all agreed it was a magnificent evening
He loves fish tacos
data = [['I', 'ate', 'dinner'], ['We', 'had', 'a', 'three', 'course', 'meal'], ['In', 'the', 'end', 'we', 'all', 'felt', 'like', 'we', 'ate', 'too', 'much'], ['We', 'all', 'agreed', 'it', 'was', 'a', 'magnificent', 'evening'], ['He', 'loves', 'fish', 'tacos']]

id2word = corpora.Dictionary(data) # gensim.corpora
texts = data
corpus = [id2word.doc2bow(text1) for text1 in texts]
lda_mallet_model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=5, id2word=id2word, random_seed = 41)  # I am using gensim-version 3.8.3
gensim_model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(lda_mallet_model)
pyldaVis_prepared_model = pyLDAvis.gensim_models.prepare(gensim_model, corpus, id2word) # this lines gives the error

The error is:

  File "/usr/local/lib/python3.9/site-packages/pyLDAvis/gensim_models.py", line 123, in prepare
    return pyLDAvis.prepare(**opts)
  File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 439, in prepare
    topic_info = _topic_info(topic_term_dists, topic_proportion,
  File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 280, in _topic_info
    return pd.concat([default_term_info] + list(topic_dfs))
  File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 264, in topic_top_term_df
    term_ix = topic_terms.unique()
  File "/usr/local/lib/python3.9/site-packages/pandas/core/series.py", line 1872, in unique
    result = super().unique()
  File "/usr/local/lib/python3.9/site-packages/pandas/core/base.py", line 1047, in unique
    result = unique1d(values)
  File "/usr/local/lib/python3.9/site-packages/pandas/core/algorithms.py", line 407, in unique
    uniques = table.unique(values)
  File "pandas/_libs/hashtable_class_helper.pxi", line 4719, in pandas._libs.hashtable.PyObjectHashTable.unique
  File "pandas/_libs/hashtable_class_helper.pxi", line 4666, in pandas._libs.hashtable.PyObjectHashTable._unique
  File "/usr/local/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4273, in __hash__
    raise TypeError(f"unhashable type: {repr(type(self).__name__)}")
TypeError: unhashable type: 'Int64Index'

I can reproduce the error now...work in progress TBC

import pyLDAvis
from pyLDAvis import gensim_models
import gensim

data = [['I', 'ate', 'dinner'], ['We', 'had', 'a', 'three', 'course', 'meal'], ['In', 'the', 'end', 'we', 'all', 'felt', 'like', 'we', 'ate', 'too', 'much'], ['We', 'all', 'agreed', 'it', 'was', 'a', 'magnificent', 'evening'], ['He', 'loves', 'fish', 'tacos']]

id2word = gensim.corpora.Dictionary(data)
texts = data
corpus = [id2word.doc2bow(text1) for text1 in texts]
mallet_path = '/Users/msusol/DATASCIENCE/mallet-2.0.8/bin/mallet'
lda_mallet_model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=5, id2word=id2word, random_seed = 41)
gensim_model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(lda_mallet_model)
pyldaVis_prepared_model = gensim_models.prepare(gensim_model, corpus, id2word) # this lines gives the error
--> 282     return pd.concat([default_term_info] + list(topic_dfs))

ipdb> pd.concat([default_term_info] + list(topic_dfs))
*** TypeError: unhashable type: 'Int64Index'

for a working example (pyLDAvis_overview.ipynb), we get

ipdb> top_terms
topic  18    0     10    16    13    14    12    9     4     6     19    15  \
0      30  2821  1712  3493  2423   657   985  3841  1227  3297   814   759   
1      43  3987  3474  3903  3067   660  3556  4048  1878  3499  2691  3346   
2      52  4065  6450  4036  3398   650  4346  4131  2659  3710  3364  4675   
3      73  4314  7379  4527  3675  3201  4395  4852  2745  4360  3920  4765   
4      77  4364  7703  5610  3849  3725  4955  5248  2892  4497  1146  5889   
..     ..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   
25     34   427    36   297   255   413    87  1977  1878  1154   971   651   
26     32    14   236   566   905     5   873   195   977   212    45  1239   
27     40   189     8   186    35  1170   378   212  1309  2056  2078   118   
28     43   967    74   498   244  1431   392  1922   128    35   222   901   
29      9   271   253   119    94   706   755   296  1868   713  2086   331   

topic    17    8     2     7     1     11    3     5   
0      1973   904  2436  1749  2224  3553  2287  1910  
1      2489  1562  2666  2636  2309  3665  2377  2190  
2      3516  1905  2818  3596  2813  3736  2577  3412  
3      3966  2024  2834  4352  3740  4122  3612  4984  
4      3977   824  4223  4842  4917  4445  3875  5010  
..      ...   ...   ...   ...   ...   ...   ...   ...  
25      286  1830  2436   728  2846   231   114   719  
26      923  1446    27  1300   220  1485  2621  2990  
27       25    11   962  2644  2501    23   214   813  
28     1253   173   156    45    74  1247   178   101  
29      140  2375    87    70   242  1514   244  1100  

[3030 rows x 20 columns]

ipdb> pd.concat([default_term_info] + list(topic_dfs))

           Term         Freq        Total Category  logprob  loglift
1         movie  5510.000000  5510.000000  Default  30.0000  30.0000
0          film  8913.000000  8913.000000  Default  29.0000  29.0000
2          good  2399.000000  2399.000000  Default  28.0000  28.0000
7     character  1922.000000  1922.000000  Default  27.0000  27.0000
4         story  2130.000000  2130.000000  Default  26.0000  26.0000
...         ...          ...          ...      ...      ...      ...
813        mike    24.886436   119.708520  Topic20  -5.4332   2.9934
1100        fbi    23.930001    93.158293  Topic20  -5.4724   3.2049
305        fans    25.842871   240.489458  Topic20  -5.3955   2.3335
25          big    27.755741  1055.640041  Topic20  -5.3241   0.9257
101      horror    23.930001   460.164805  Topic20  -5.4724   1.6077

[1496 rows x 6 columns]

your model produces

ipdb> top_terms
                                                    0   1   2   3   4
2   Int64Index([2, 7, 11, 12, 18, 25, 1], dtype='i... NaN NaN NaN NaN
0         Int64Index([1], dtype='int64', name='term') NaN NaN NaN NaN
4    Int64Index([19, 24], dtype='int64', name='term') NaN NaN NaN NaN
3        Int64Index([21], dtype='int64', name='term') NaN NaN NaN NaN
1        Int64Index([26], dtype='int64', name='term') NaN NaN NaN NaN
..                                                ...  ..  ..  ..  ..
2   Int64Index([1, 2, 7, 11, 12, 18, 25], dtype='i... NaN NaN NaN NaN
0         Int64Index([1], dtype='int64', name='term') NaN NaN NaN NaN
4    Int64Index([19, 24], dtype='int64', name='term') NaN NaN NaN NaN
3        Int64Index([21], dtype='int64', name='term') NaN NaN NaN NaN
1        Int64Index([26], dtype='int64', name='term') NaN NaN NaN NaN

[1099 rows x 5 columns]

ipdb> pd.concat([default_term_info] + list(topic_dfs))
*** TypeError: unhashable type: 'Int64Index'

Hi Mark, I was wondering - did you manage to find some time to look into the above? Many thanks & best regards, Mike

Hi, I believe the problem is that word_topics array in lda_mallet_model consists some elements of 0.

On line 258-259 in _prepare.py,

log_lift = np.log(pd.eval("topic_term_dists / term_proportion")).astype("float64")
log_ttd = np.log(pd.eval("topic_term_dists")).astype("float64")

when pyldavis calculate log_lift and log_ttd, some of the elements in topic_term_dists are 0 and np.log(0) gives -inf.

Then, on line 217-219 in _prepare.py,

def _find_relevance(log_ttd, log_lift, R, lambda_):
    relevance = lambda_ * log_ttd + (1 - lambda_) * log_lift
    return relevance.T.apply(lambda topic: topic.nlargest(R).index)

when it calculates relevance for different lambda_ values (0, 0.01, 0.02, ... ,1), for lambda_=0 and lambda_=1 it becomes problematic because 0*-inf gives nan. By default, pandas will ignore the nan values when it's finding the nlargest terms in a given topic. As a result, it returns series of different lengths, which is why you get top_terms like this

ipdb> top_terms
                                                    0   1   2   3   4
2   Int64Index([2, 7, 11, 12, 18, 25, 1], dtype='i... NaN NaN NaN NaN
0         Int64Index([1], dtype='int64', name='term') NaN NaN NaN NaN
4    Int64Index([19, 24], dtype='int64', name='term') NaN NaN NaN NaN
3        Int64Index([21], dtype='int64', name='term') NaN NaN NaN NaN
1        Int64Index([26], dtype='int64', name='term') NaN NaN NaN NaN
..                                                ...  ..  ..  ..  ..
2   Int64Index([1, 2, 7, 11, 12, 18, 25], dtype='i... NaN NaN NaN NaN
0         Int64Index([1], dtype='int64', name='term') NaN NaN NaN NaN
4    Int64Index([19, 24], dtype='int64', name='term') NaN NaN NaN NaN
3        Int64Index([21], dtype='int64', name='term') NaN NaN NaN NaN
1        Int64Index([26], dtype='int64', name='term') NaN NaN NaN NaN

Finally , when we call top_terms.unique() in line 264, it gives the error

  File "/usr/local/lib/python3.9/site-packages/pyLDAvis/gensim_models.py", line 123, in prepare
    return pyLDAvis.prepare(**opts)
  File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 439, in prepare
    topic_info = _topic_info(topic_term_dists, topic_proportion,
  File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 280, in _topic_info
    return pd.concat([default_term_info] + list(topic_dfs))
  File "/usr/local/lib/python3.9/site-packages/pyLDAvis/_prepare.py", line 264, in topic_top_term_df
    term_ix = topic_terms.unique()
  File "/usr/local/lib/python3.9/site-packages/pandas/core/series.py", line 1872, in unique
    result = super().unique()
  File "/usr/local/lib/python3.9/site-packages/pandas/core/base.py", line 1047, in unique
    result = unique1d(values)
  File "/usr/local/lib/python3.9/site-packages/pandas/core/algorithms.py", line 407, in unique
    uniques = table.unique(values)
  File "pandas/_libs/hashtable_class_helper.pxi", line 4719, in pandas._libs.hashtable.PyObjectHashTable.unique
  File "pandas/_libs/hashtable_class_helper.pxi", line 4666, in pandas._libs.hashtable.PyObjectHashTable._unique
  File "/usr/local/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 4273, in __hash__
    raise TypeError(f"unhashable type: {repr(type(self).__name__)}")
TypeError: unhashable type: 'Int64Index'

Given that this problem is not specific to LDA Mallet (since any model which has 0 in topic_term_dist will have the same problem), I think the easiest fix is to replace the 0 in topic_term_dists with values close to 0 but not exactly 0 such as 1e-10 to avoid getting -inf at the first place.

I suggest to modify line 258-259 in _prepare.py to

# to avoid -inf when calculating log_lift and log_ttd
topic_term_dists_non_zero = topic_term_dists.replace(0,1e-10)
log_lift = np.log(pd.eval("topic_term_dists_non_zero / term_proportion")).astype("float64")
log_ttd = np.log(pd.eval("topic_term_dists_non_zero")).astype("float64")