Note: This repository is the official home for the
bangla-pdf-ocr
package. You can use this repository for package-related issues, discussions, and contributions. We welcome your feedback and involvement in improving the tool!
Bangla PDF OCR is a powerful tool that extracts Bengali text from PDF files. It's designed for simplicity and works on Windows, macOS, and Linux without any extra downloads or configurations. This tool was initially developed as a part of the Bangla RAG (Retrieval-Augmented Generation) pipeline project, specifically designed to enhance the PoRAG system, but it can be used independently for Bengali OCR tasks. Use it as a standalone tool for your Bengali OCR needs.
- Extracts Bengali text from PDFs quickly and accurately
- Works on Windows, macOS, and Linux
- Easy to use from both command line and Python scripts
- Installs all necessary components automatically
- Supports other languages besides Bengali
- Multi-threaded processing for improved performance
-
Install the package:
pip install bangla-pdf-ocr
-
Run the setup command to install dependencies:
bangla-pdf-ocr-setup
-
Start using it right away!
From command line:
bangla-pdf-ocr your_file.pdf
In your Python script:
from bangla_pdf_ocr import process_pdf path = "path/to/your/pdf_file.pdf" output_file = "output.txt" extracted_text = process_pdf(path, output_file) print(extracted_text)
That's it! No additional downloads or configurations needed.
- Python 3.6 or higher
- pip (Python package installer)
-
Install the package from PyPI:
pip install bangla-pdf-ocr
-
Set up system dependencies:
bangla-pdf-ocr-setup
This command installs necessary dependencies based on your operating system:
- Linux: Installs
tesseract-ocr
,poppler-utils
, andtesseract-ocr-ben
- macOS: Installs
tesseract
,poppler
, andtesseract-lang
via Homebrew - Windows: Downloads and installs Tesseract OCR and Poppler, adding them to the system PATH
Note: On Windows, you may need to run the command prompt as administrator.
- Linux: Installs
-
Verify the installation:
bangla-pdf-ocr-verify
This command checks if all required dependencies are properly installed and accessible.
Basic usage:
bangla-pdf-ocr [input_pdf] [-o output_file] [-l language]
input_pdf
: Path to the input PDF file (optional, uses a sample PDF if not provided)-o, --output
: Specify the output file path (default: input filename with.txt
extension)-l, --language
: Specify the OCR language (default: 'ben' for Bengali)
-
Process the default sample PDF:
bangla-pdf-ocr
-
Process a specific PDF:
bangla-pdf-ocr path/to/my_document.pdf
-
Specify an output file:
bangla-pdf-ocr path/to/my_document.pdf -o path/to/extracted_text.txt
-
Try a sample PDF extraction:
bangla-pdf-ocr
This command processes a sample Bengali PDF file included with the package, demonstrating the text extraction capabilities.
You can also use Bangla PDF OCR as a module in your Python scripts:
from bangla_pdf_ocr import process_pdf
path = "bangla_pdf_ocr\data\Freedom Fight.pdf"
output_file = "Extracted_text.txt"
extracted_text = process_pdf(path, output_file)
print(f"Text extracted and saved to: {output_file}")
If you encounter any issues:
-
Run the verification command:
bangla-pdf-ocr-verify
-
For Windows users:
- Run
setup/verify
command prompts as administrator if you encounter permission issues. - Restart your command prompt or IDE after installation to ensure PATH changes take effect.
- Run
-
Check the console output and logs for any error messages.
-
If automatic installation fails, refer to the manual installation instructions provided by the setup command.
-
Ensure you have the latest version of the package:
pip install --upgrade bangla-pdf-ocr
-
If problems persist, please open an issue on our GitHub repository with detailed information about the error and your system configuration.
If you encounter any problems or have suggestions for Bangla PDF OCR:
- Check existing issues to see if your issue has already been reported.
- If not, create a new issue on our GitHub repository.
- Provide detailed information about the problem, including steps to reproduce it.
We appreciate your feedback to help improve Bangla PDF OCR!
Happy OCR processing!