Negative ref_dif in Blocks.py causing paragraph splitting
tehwenyi opened this issue · 0 comments
I've encountered an issue with paragraph splitting in some documents, where certain pages separate sentences in the same paragraph into different text blocks while others do not. Upon investigation, I found that the problem originates from the _join_lines_vertically
function (line 423) in Blocks.py
, particularly with the ref_dif
value generated by the common_vertical_spacing
function (line 444).
The issue arises when ref_dif
becomes negative (e.g., -15.3, -8.2) on certain pages, causing the start_new_block
flag to be True
when it should be False
, thus incorrectly initiating a new block at every sentence. I'd like to check if this behavior is intended?
To address this issue, I made a modification on line 452 in common_vertical_spacing
of Blocks.py to ensure that ref_dif
always remains positive:
return max(max(distances, key=distances.count), 0.0) if distances else 0.0
If this behavious is indeed unexpected and the proposed fix resolves the issue for you, I'd be happy to make a pull request.
Thanks!