python-openxml/python-docx

Content of merged cells

Closed this issue · 4 comments

I am merging cells with the same content where I want the following table

Screenshot 2024-10-30 at 14 25 30

to merge to

Screenshot 2024-10-30 at 14 25 16

Based on this (from the documentation):

When two or more cells are merged, any existing content is concatenated and placed in the resulting merged cell. Content from each original cell is separated from that in the prior original cell by a paragraph mark. An original cell having no content is skipped in the contatenation process.
Merging four cells with content 'a', 'b', '', and 'd' respectively results in a merged cell having text 'a\nb\nd'.

I thought the following code should produce a merged cell with only one paragraph, however it looks like there are two paragraphs in the merged cell.

from docx import Document
doc = Document('fuits.docx')
tab = doc.tables[0]
c1 = tab.cell(1,0)
c2 = tab.cell(2,0)

c2.text = ''
cm = c1.merge(c2)
cm.paragraphs

Result:

[<docx.text.paragraph.Paragraph at 0x112cb9330>,
 <docx.text.paragraph.Paragraph at 0x112cba6b0>]

Is this expected?

scanny commented

Sounds plausible. What is the specific before and after text of the cells? The "\n" in the docs would be in cell.text. If you're looking in cell.paragraphs each "\n" would start a new paragraph.

scanny commented

If you merged the top version you would receive the bottom version but with "Apple\nApple". The merging algorithm doesn't have anything to do with deduplicating text. You'll have to take care of that yourself.

Thanks for taking a look at this @scanny. From the documentation it said "An original cell having no content is skipped in the contatenation process" so I thought by doing c2.text = '' the whole paragraph will be skipped, thanks for clearing that up!

scanny commented

This is the code that controls that behavior:
https://github.com/python-openxml/python-docx/blob/master/src/docx/oxml/table.py#L616-L630

I expect what's happening is that a paragraph with a run that contains the empty string is considered distinct from a paragraph with no runs. The latter being what you get if the cell was empty from the start.

If you wanted to hack something in then this might produce the result you're looking for:

# -- instead of `cell.text = ""`... --
tc = cell._tc
tc.clear_content()
p = tc.add_p()

This is what the Cell.text setter is doing but it goes a little further:
https://github.com/python-openxml/python-docx/blob/master/src/docx/table.py#L273-L284