Space after capital letter
gaehlerm opened this issue · 0 comments
gaehlerm commented
I tried so search a pdf file created by Markdown PDF using pypdf. But it didn't work as expected because pypdf frequently found a white space after a capital letter.
I think this is a problem of markdown pdf as I couldn't reproduce this error with pdf files from other sources. Though I did not check extensively.
Here is the pdf file I created with Markdown PDF:
testfile.pdf
Here is the python script to find the bug (I used pypdf version 4.2.0):
import pypdf
PDF_FILE = "testfile.pdf"
def get_all_text():
all_text = ""
complete_text = pypdf.PdfReader(PDF_FILE)
for page_obj in complete_text.pages:
text = page_obj.extract_text()
all_text += text
with open("all_text.txt", "w") as file:
file.write(all_text)
if __name__ == "__main__":
get_all_text()
Here is the output (watch the spaces after the capital letters). The output seems to be reproducible.
testfile.md 2024-06-18
1 / 1A Lot Of Capitalized W ords Like S witzerland For Example. Where Is R obert?