How to extract pdf page text line by line?

Question

How to extract pdf page text line by line?

Closed this issue 23 days ago · 1 comments

I am trying to extract pdf text line by line.

I have tried

doc = fitz.open("UMNwriteup.pdf")
page =doc.load_page(0)

Option 1
page.get_text('text').split("\n")

but that results in some lines being broken up into chunks (because spacing between words in one sentence is too much and a new line character is inputted.

Option 2
page.get_text('blocks')

That is more towards what I'm looking for, but some chunks (multi-line sentences) are intelligently grouped together.

Option 3


dictionary_elements = page.get_text('dict')
for block in dictionary_elements['blocks']:
    line_text = ''
    for line in block['lines']:
        for span in line['spans']:
             line_text += ' ' + span['text']

This results in output similar to option 2.

So how do I extract text line by line, without any chunking / blocks behinds the scenes?

If I can stop putting new line characters between two words that are separated by blank spaces (even though on same bbox height), that should solve this for me.

Hi @JorjMcKie Thanks for any help.

Answer 1 · 2024-06-05T09:21:32.000Z

This is no bug. But there is a way to get correct results. Please continue in the Discussions tab.