gregversteeg/corex_topic

Index error when using anchors

owlas opened this issue · 7 comments

owlas commented

When using the fit method with anchors I get an index error from this line:

p_y_given_x[:, j] = 0.5 * p_y_given_x[:, j] + 0.5 * X[:, a].mean(axis=1).A1 # Assumes X is a binary matrix

The error is understandable because if X is a 2d array, then X[:,i] is a 1d slice and thereforeX[:,i].mean(axis=1) is undefined because there is no dimension 1.

I've installed version corextopic==1.0.5 from pypi.

I can reproduce this for any arguments passed to anchors

owlas commented

Is a expected to be a list? I'm wondering if there is something missing in the preprocessing step for turning anchors of words into indices.

Thanks for pointing this out. I think anchors has to be a list of lists: like anchors = [['cat', 'dog'], ['apple']].
Then "a" will be a list, and the slice will always produce an object that has axis=1!
However! I note that the examples on the readme do not always match this intuition. I'll ask @ryanjgallagher to check if there is an issue with the readme.

topic_model.fit(X, words=words, anchors=[['dog','cat'], 'apple'], anchor_strength=2)
Should be [['dog','cat'], ['apple']]?

topic_model.fit(X, words=words, anchors=['protest', 'protest', 'protest', 'riot', 'riot', 'riot'], anchor_strength=2)
Should be [['protest'], ['protest']... ?

topic_model.fit(X, words=words, anchors=[['bernese', 'mountain', 'dog'], ['mountain', 'rocky', 'colorado'], anchor_strength=2)
This one is missing a bracket I think.

The anchor input for the topic model should be a list. Within that list, the entries can be either strings, ints, or lists, and you should be able to do any combination of those for anchoring. Individual strings or ints (indicating you want to anchor only one word to a topic) are converted to lists with a single entry, and strings are converted to their corresponding column index.

So @gregversteeg isn't quite right, even if you pass just a string or int in the anchors list, it should be preprocessed properly. (He's right that the last example is missing a bracket though, my bad).

@owlas Can you provide a simple reproducible example? I can run the examples in the README without any issues using version 1.0.5, so I'm not sure what error you may be getting.

Can report the same for 1.0.6. Although I cannot reproduce the error consistently. For some sets of anchors it works, for some it throws an error (all of those should be valid according to the docs though)

@d-lowl Would you be able to provide a minimal reproducible example that does fail?

Yeah, I made it work for my case (it was partially a problem in my code), but I still think that there are some edge cases. I will play around with it tomorrow and try to come up with one.

Hi! I am not sure if this repo is still maintained but I ran into the same issue.

I found that single item anchor lists are transferred (back) into single items here:
https://github.com/gregversteeg/corex_topic/blob/beea64bc41e62dffc5fb87deb506a3e253be0a6c/corextopic/corextopic.py#L367C27-L367C27

This would contradict @ryanjgallagher comment:

Individual strings or ints (indicating you want to anchor only one word to a topic) are converted to lists with a single entry

Removing the edge case for single item lists solved the problem for me:

if len(new_anchor_list) == 0:
    continue
# if len(new_anchor_list) == 1:
#    processed_anchors.append(new_anchor_list[0])
else:
    processed_anchors.append(new_anchor_list)