pdfocr

A Python 3.6+ cli tool to convert pdf and image to editable text.

python pdfocr.py sample.pdf

Installation

Install Pipenv

For macOS:
```
brew install pipenv
```
For other systems:
```
pip install --user pipenv
```

Install dependencies and activate virtualenv:
```
pipenv install
pipenv shell
```
Install Poppler
- For macOS:
```
brew install poppler
```
- For Ubuntu: Check this gist.

Features

Use OCR API service to identify text in pdf and images.
Support file types: pdf ,jpg, png, bmp.
Support languages: English, Chinese, Portuguese, French, German, Italian,Spanish,Russian, Japanese, Korean.

Config

The tool uses API service to perform OCR.

Currently it uses Baidu's API, with my API key.

If the built-in API key's quota run out (an error message like request limit reached), you can either:

Set the cli option accurate to be False (The less accurate version's api has more quota, but it still could run out).
Or use your own API key (it's free). Go to Baidu Cloud to apply for your API key, then fill in the APP_ID, API_KEY and SECRET_KEY fields in ocr/baidu_ocr.py.

Usage

pdfocr.py --i I [--o O] [--lang LANG]

--i:       (Required) The input file path. Support pdf, jpg, png, bmp.
--o:       (Optional) The output file path. By default it would be input_file_name.txt in current directory.
--lang:    (Optional) Use one the following: 'ENG' (default), 'CHN_ENG', 'POR', 'FRE', 'GER', 'ITA', 'SPA', 'RUS', 'JAP', 'KOR'.
--accurate (Optional) Whether to use the accurate ocr api. Default is True.

A full command would be:

python pdfocr.py --i=sample.pdf --o=sample.txt --lang=ENG --accurate=True

Known Issues