This project is based on invoice2data which extracts data from invoice pdf files using user written regular expressions. It is observed that quite some invoices cannot be extracted just using the regular expression approach since pdf text is organized not as lines of text rather as boxes with co-ordinates. Field values are placed at locations like to the right, bottom of a field so on. Line items can also placed in tables. This solution tries to use other techniques along with regular expressions to solve the problem.
The solution is written in python and provides the following capabilities
- Extract Field values (Based on regex or location based - right, bottom ...)
- Table based line item extraction (regex, vertical or horizontal table lines ...)
- Check if the extracted line item Total values matches the Sum Total.
- Requires python 3
- pip/conda install the following libraries json, re, deepcopy, configparser, logging, pdfminer, argparse, sortedcontainers, sets, pluginbase
Templates which are in json format are used to tell the application how to pick field values (regex, top, bottom of field name) and also how line items are present in the invoice (With or Without Horizontal Vertical Lines, Line item columns…).
Look at the data folder which contains jpg images of a few invoices (original native pdfs). Corresponding templates can be found at the template folder. Reading the jpg and its corresponding template json side-by-side would help easily understand the template contents.
Refer doc/invoice-extractor-checker.pdf for detailed documentation.
From the src folder, python Main.py --dump --template --file --output
E.g:-, python Main.py --dump dump.txt --template ../template/ --file "../data/Amazon-Storeji.pdf" --output "../output/Amazon-Storeji.json"
- Simple User Interface to edit template files.
- Docker image.
- Multi page/Multi header occurrences/Multi header pdfs.
- Improve Table Line items extractions.
- Run on more invoice samples/new formats.