CorEx pickle.load() breaks with Unicode words

Question

CorEx pickle.load() breaks with Unicode words

ryanjgallagher opened this issue 7 years ago · 3 comments

If 'words' is passed to the CorEx object upon training, and 'words' contains Unicode characters, then you cannot load the CorEx object using pickle.load() after saving it using pickle.dump().

If you are planning on saving and later using the topic model object, then you can either not load words into it and just make the topics yourself using the indices, or you can save the parts of the CorEx topic model that you want for later. Both of these workarounds are a hassle.

Not sure how to fix this.

Answer 1 · 2017-07-18T22:17:41.000Z

What is the pickling error you get? I’m surprised this happens, I’m sure it should be possible to pickle unicode. Maybe the “write mode” for the files is wrong? Is it “w” or “wb” when pickling? I think it should be the latter. open(“file.dat”, “wb”)

…

On Jul 18, 2017, at 3:03 PM, Ryan Gallagher ***@***.***> wrote: If 'words' is passed to the CorEx object upon training, and 'words' contains Unicode characters, then you cannot load the CorEx object using pickle.load() after saving it using pickle.dump(). If you are planning on saving and later using the topic model object, then you can either not load words into it and just make the topics yourself using the indices, or you can save the parts of the CorEx topic model that you want for later. Both of these workarounds are a hassle. Not sure how to fix this. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AH8ph2DTVWZ8x2iT8oVM-r15RFgECzK3ks5sPSufgaJpZM4Ob-24>.

Answer 2 · 2017-07-18T22:24:23.000Z

Here's an example:
ValueError: ('initialization string is too small', <built-in function scalar>, (dtype('<U2'), '\x00\xd0\x12\x00'))
This is from a topic model that I saved using "wb". Had a hard time figuring out what this error means from a quick Google search.

Answer 3 · 2017-07-27T18:55:47.000Z

Just used "x00\xd0\x12\x00" in the example that's in the documentation on the main page, and when you save and load the CorEx object, everything is fine. I'm not sure what could be going on, but I think I'll update the CorEx object so that if you use the built-in save() function then it doesn't save "words".