Lines merger

What for?

An OCR program such as FineReader recognizes the text in the image and creates a pdf file where the line breaks correspond to the line breaks in the original image. In the hyphenation places, which are located inside the word, the program puts a special character called "soft hyphen".

The program in this repository takes the text extracted from the pdf and stitches the lines together when it sees an intra-word hyphenation.

In addition, since this program is designed to work on a project of digitization of Russian novel, the program removes improbable characters for 19th century Russian text.

How to?

You must extract the text from the pdf file. One way you can do this:

import pdftotext

# Load your PDF
with open("file.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

# Save all text to a txt file.
with open('file.txt', 'w') as f:
    f.write("\n====page====\n".join(pdf))

You have to run the script on the command line and set the folder where the txt files are located as a parameter:

python merge.py plain_text

The script will create the same folder with the processed files and add a suffix _merged.

Examples

See source and target.

nevmenandr/lines_merge_OCR

Lines merger

What for?

How to?

Examples