This application extracts text from an image of a receipt using Optical Character Recognition (OCR) and saves the extracted information into an Excel file. The program is particularly useful for converting receipt images into structured Excel data, which can be further processed or analyzed.
- OCR with Tesseract: Converts receipt images into text using the Tesseract OCR engine.
- Excel Output: Saves the extracted items and costs into an Excel file.
- GUI for File Selection: Uses file dialogs to allow users to select the input image and output Excel file locations.
main.py
: The main script that handles the entire process, including OCR, parsing, and saving to Excel.requirements.txt
: A list of Python dependencies required to run the project.
- Python 3.x
- Tesseract OCR installed on your system (with
tesseract
in yourPATH
) - Python libraries:
pytesseract
Pillow
pandas
tkinter
(usually comes pre-installed with Python)
You can install the required Python packages using pip:
pip install -r requirements.txt
Or install them individually:
pip install pytesseract pillow pandas
- Download the Tesseract installer from this link.
- Run the installer and follow the installation instructions.
- Ensure the installation path (e.g.,
C:\Program Files\Tesseract-OCR\
) is added to your system'sPATH
environment variable.
If you have Homebrew installed, run:
brew install tesseract
Use your package manager to install Tesseract. For example, on Ubuntu:
sudo apt-get install tesseract-ocr
-
Run the Script:
Execute the
main.py
script. A file dialog will appear, allowing you to select the receipt image file.
python main.py
-
Select the Image File:
Choose the image file of the receipt that you want to convert to Excel.
-
Process the Receipt:
The application will use OCR to extract text from the image and parse it into items and costs.
-
Save the Excel File:
After processing, another file dialog will prompt you to choose where to save the Excel file.
-
Review the Excel File:
Open the resulting Excel file to see the extracted items and costs organized in a spreadsheet.
An image of a receipt with items and their costs.
An Excel file with two columns: Item
and Cost
.
Item | Cost |
---|---|
AKS Water 4-Pack | 3.99 |
Thai Jasmine Rice | 18.99 |
KS Dog Food | 34.99 |
If you encounter an error indicating that Tesseract is not found, ensure that Tesseract is correctly installed and added to your system's PATH
. You can verify the installation by running:
tesseract --version
If Tesseract is not in your PATH
, you can specify the path in your Python script:
import pytesseract
# Set the tesseract_cmd to the path of your Tesseract executable
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
This project is licensed under the MIT License.