gregversteeg/corex_topic

CorEx pickle.load() breaks with Unicode words

ryanjgallagher opened this issue · 3 comments

If 'words' is passed to the CorEx object upon training, and 'words' contains Unicode characters, then you cannot load the CorEx object using pickle.load() after saving it using pickle.dump().

If you are planning on saving and later using the topic model object, then you can either not load words into it and just make the topics yourself using the indices, or you can save the parts of the CorEx topic model that you want for later. Both of these workarounds are a hassle.

Not sure how to fix this.

Here's an example:
ValueError: ('initialization string is too small', <built-in function scalar>, (dtype('<U2'), '\x00\xd0\x12\x00'))
This is from a topic model that I saved using "wb". Had a hard time figuring out what this error means from a quick Google search.

Just used "x00\xd0\x12\x00" in the example that's in the documentation on the main page, and when you save and load the CorEx object, everything is fine. I'm not sure what could be going on, but I think I'll update the CorEx object so that if you use the built-in save() function then it doesn't save "words".