How to extract pdf page text line by line?
Closed this issue · 1 comments
I am trying to extract pdf text line by line.
I have tried
doc = fitz.open("UMNwriteup.pdf")
page =doc.load_page(0)
Option 1
page.get_text('text').split("\n")
but that results in some lines being broken up into chunks (because spacing between words in one sentence is too much and a new line character is inputted.
Option 2
page.get_text('blocks')
That is more towards what I'm looking for, but some chunks (multi-line sentences) are intelligently grouped together.
Option 3
dictionary_elements = page.get_text('dict')
for block in dictionary_elements['blocks']:
line_text = ''
for line in block['lines']:
for span in line['spans']:
line_text += ' ' + span['text']
This results in output similar to option 2.
So how do I extract text line by line, without any chunking / blocks behinds the scenes?
If I can stop putting new line characters between two words that are separated by blank spaces (even though on same bbox height), that should solve this for me.
Hi @JorjMcKie Thanks for any help.
This is no bug. But there is a way to get correct results. Please continue in the Discussions tab.