4teamwork/docxcompose

Memory problems

Opened this issue · 4 comments

dears:
when I merge 200 files, memory consumption is too high. anyway to solve it?

@Link-Go Would you mind to share your solution if you finally solved this issue? Thanks in advance

@abubelinha (self-mention to find this when searching issues)

@abubelinha and anyone else who is facing the same issue:
tl;dr - Seems like there is a reference being held that prevents the docx being appended/inserted from being garbage collected after the append/insert. Workaround that we've employed:

import gc
import docx
from docxcomposer import composer
...

def merge(composer, doc_path):
    ...
    doc_to_merge = docx.Document(doc_path)
    composer.append(doc_to_merge)
    # XXX at this point doc_to_merge will not be gc()'d automatically
    # when it goes out of scope
    ...

def merge_all(document_paths):
    ...
    merged_doc = docx.Document()
    composer = compose.Composer(merged_doc)
    for document_path in document_paths:
        merge(composer, document_path)
        gc.collect()  # this ensures doc_to_merge is gc()'d

I did some memory profiling (using memory-profiler ) and noticed that although the append/insert methods do not directly 'add' to the memory footprint, the docx object being appended/inserted does not get garbage collected after the call to the methods. Furthermore, it might seem like there is perhaps a circular self-reference somehow being maintained that prevents this gc from occurring.

Our workaround to this is to invoke gc.collect(), soon after we've called .append(). This fixes the problem for now. I might dig a bit deeper and see whether I can fix the underlying issue of the references being held and update this ticket if I manage to isolate it.

It would be helpful to know if I'm on the right track here and the workaround works for others as well.

Sorry but I am just new to this package and had actually not tried to implement anything.
Just wondering how to do it in case of receiving a @Link-Go answer.

But I do not fully understand your example, as your functions don't return anything.
Could you share a full script I can just run, so I can tell you if it works for me as well?
Thanks!

I checked memory consumption with and without garbage collection and did not see much difference..
Screenshot 2022-07-04 at 17 52 10
.