yzane/vscode-markdown-pdf

Space after capital letter

gaehlerm opened this issue · 0 comments

I tried so search a pdf file created by Markdown PDF using pypdf. But it didn't work as expected because pypdf frequently found a white space after a capital letter.
I think this is a problem of markdown pdf as I couldn't reproduce this error with pdf files from other sources. Though I did not check extensively.

Here is the pdf file I created with Markdown PDF:
testfile.pdf

Here is the python script to find the bug (I used pypdf version 4.2.0):

import pypdf

PDF_FILE = "testfile.pdf"

def get_all_text():
    all_text = ""

    complete_text = pypdf.PdfReader(PDF_FILE)
    for page_obj in complete_text.pages:    
        text = page_obj.extract_text()
        all_text += text

    with open("all_text.txt", "w") as file:
        file.write(all_text)

if __name__ == "__main__":
    get_all_text()

Here is the output (watch the spaces after the capital letters). The output seems to be reproducible.

testfile.md 2024-06-18
1 / 1A Lot Of Capitalized W ords Like S witzerland For Example. Where Is R obert?