ArtifexSoftware/pdf2docx

Negative ref_dif in Blocks.py causing paragraph splitting

tehwenyi opened this issue · 0 comments

I've encountered an issue with paragraph splitting in some documents, where certain pages separate sentences in the same paragraph into different text blocks while others do not. Upon investigation, I found that the problem originates from the _join_lines_vertically function (line 423) in Blocks.py, particularly with the ref_dif value generated by the common_vertical_spacing function (line 444).

The issue arises when ref_dif becomes negative (e.g., -15.3, -8.2) on certain pages, causing the start_new_block flag to be True when it should be False, thus incorrectly initiating a new block at every sentence. I'd like to check if this behavior is intended?

To address this issue, I made a modification on line 452 in common_vertical_spacing of Blocks.py to ensure that ref_dif always remains positive:

return max(max(distances, key=distances.count), 0.0) if distances else 0.0

If this behavious is indeed unexpected and the proposed fix resolves the issue for you, I'd be happy to make a pull request.

Thanks!