Merging corpora requires converting itertools chain object to list object
mspezio opened this issue · 2 comments
When merging corpora, it is essential to convert the itertools.chain object to a list. Otherwise the serialization will not save the older corpus.
now we can merge corpora from the two incompatible dictionaries into one
merged_corpus = itertools.chain(some_corpus_from_dict1, dict2_to_dict1[some_corpus_from_dict2])
should be
merged_corpus = list(itertools.chain(some_corpus_from_dict1, dict2_to_dict1[some_corpus_from_dict2]))
Then the merged_corpus can be serialized using the standard
MmCorpus.serialize(merged_corpus_output_fname, merged_corpus)
Definitely not. Do not convert your corpora to lists.
My results were obtained with Python 3.11 running in Visual Studio Code 1.78.0.
Gensim 4.3.1. Installed via pip.
I've attached the corpus (from doc2bow) and dictionary files and the code that replicates the issue is this:
C1 = corpora.MmCorpus(C1file)
D1 = corpora.Dictionary.load(D1file)
C2 = corpora.MmCorpus(C2file)
D2 = corpora.Dictionary.load(D2file)
merged_dict = D1.merge_with(D2)
merged_corpus = itertools.chain(C1,merged_dict[C2])
corpora.MmCorpus.serialize(merged_corpus_file, merged_corpus) # results in -1 values for documents in the first corpus
The serialize operation of MmCorpus produces a merged corpus of -1 values for all documents expect for those in the second corpus. Converting itertools.chain to a list restored the expected functionality of the fully merged corpus.
If this works as expected without converting the itertools.chain object to a list then please send working code. Thank you.
Version: 1.78.0 (Universal)
MergeTry_Files.zip
Commit: 252e5463d60e63238250799aef7375787f68b4ee
Date: 2023-05-03T20:11:00.813Z
Electron: 22.4.8
Chromium: 108.0.5359.215
Node.js: 16.17.1
V8: 10.8.168.25-electron.0
OS: Darwin x64 19.6.0
Sandboxed: No