piskvorky/gensim

Merging corpora requires converting itertools chain object to list object

mspezio opened this issue · 2 comments

When merging corpora, it is essential to convert the itertools.chain object to a list. Otherwise the serialization will not save the older corpus.

now we can merge corpora from the two incompatible dictionaries into one

    merged_corpus = itertools.chain(some_corpus_from_dict1, dict2_to_dict1[some_corpus_from_dict2])

should be

merged_corpus = list(itertools.chain(some_corpus_from_dict1, dict2_to_dict1[some_corpus_from_dict2]))

Then the merged_corpus can be serialized using the standard

MmCorpus.serialize(merged_corpus_output_fname, merged_corpus)

Definitely not. Do not convert your corpora to lists.

My results were obtained with Python 3.11 running in Visual Studio Code 1.78.0.

Gensim 4.3.1. Installed via pip.

I've attached the corpus (from doc2bow) and dictionary files and the code that replicates the issue is this:

C1 = corpora.MmCorpus(C1file)
D1 = corpora.Dictionary.load(D1file)
C2 = corpora.MmCorpus(C2file)
D2 = corpora.Dictionary.load(D2file)

merged_dict = D1.merge_with(D2)
merged_corpus = itertools.chain(C1,merged_dict[C2])
corpora.MmCorpus.serialize(merged_corpus_file, merged_corpus) # results in -1 values for documents in the first corpus

The serialize operation of MmCorpus produces a merged corpus of -1 values for all documents expect for those in the second corpus. Converting itertools.chain to a list restored the expected functionality of the fully merged corpus.

If this works as expected without converting the itertools.chain object to a list then please send working code. Thank you.

Version: 1.78.0 (Universal)
MergeTry_Files.zip

Commit: 252e5463d60e63238250799aef7375787f68b4ee
Date: 2023-05-03T20:11:00.813Z
Electron: 22.4.8
Chromium: 108.0.5359.215
Node.js: 16.17.1
V8: 10.8.168.25-electron.0
OS: Darwin x64 19.6.0
Sandboxed: No