A Python 3.6+ cli tool to convert pdf and image to editable text.
python pdfocr.py sample.pdf
-
Install Pipenv
-
For macOS:
brew install pipenv
-
For other systems:
pip install --user pipenv
-
-
Install dependencies and activate virtualenv:
pipenv install pipenv shell
-
Install Poppler
-
For macOS:
brew install poppler
-
For Ubuntu: Check this gist.
-
- Use OCR API service to identify text in pdf and images.
- Support file types:
pdf
,jpg
,png
,bmp
. - Support languages:
English
,Chinese
,Portuguese
,French
,German
,Italian
,Spanish
,Russian
,Japanese
,Korean
.
The tool uses API service to perform OCR.
Currently it uses Baidu's API, with my API key.
If the built-in API key's quota run out (an error message like request limit reached
), you can either:
- Set the cli option
accurate
to beFalse
(The less accurate version's api has more quota, but it still could run out). - Or use your own API key (it's free). Go to Baidu Cloud to apply for your API key, then fill in the
APP_ID
,API_KEY
andSECRET_KEY
fields inocr/baidu_ocr.py
.
pdfocr.py --i I [--o O] [--lang LANG]
--i: (Required) The input file path. Support pdf, jpg, png, bmp.
--o: (Optional) The output file path. By default it would be input_file_name.txt in current directory.
--lang: (Optional) Use one the following: 'ENG' (default), 'CHN_ENG', 'POR', 'FRE', 'GER', 'ITA', 'SPA', 'RUS', 'JAP', 'KOR'.
--accurate (Optional) Whether to use the accurate ocr api. Default is True.
A full command would be:
python pdfocr.py --i=sample.pdf --o=sample.txt --lang=ENG --accurate=True
- The OCR result might not be 100% correct.
- The French OCR result does not contain the accent marks (e.g., é, è, etc).