This is an example of a receipt classifier using the LinearSVC supervised learning algorithm. The application uses Tesseract OCR to scan receipt images for text, and uses the occurrence each unicode character code as a parameter in LinearSVC.
- algorithm: the algorithm to use, one of LinearSVM, GaussianNB, RandomForest, SGD
- penalty: the penalty parameter passed to LinearSVM, which reduces overfitting as it decreases
- tolerance: the algorithm stopping tolerance passed to LinearSVM, which trades computation time for better accuracy as it approaches 0
- min_n_gram / max_n_gram: the min / max n to use in counting features, which treats all n consecutive characters as a feature for min_n_gram <= n <= max_n_gram
- token_pattern: the regex to use in selecting words, defaults to single characters
To train the application, set the training_data
value in the config.json
to an array of [glob, classification]
values. Glob is a glob of image files to parse, and classification is 1 if the glob is a set of taxi receipts, 0 otherwise.
To test the accuracy of the training, configure a test data set similar to the training data set above by setting the test_data
value in the config.json
.
To classify a receipt or set of receipts, call main.py
with a file or glob.
- Tesseract OCR 3.2 or greater
- numpy
- scipy
- scikit-learn
This is technically an implementation of both n-gram and bag of words. Both n in the n-gram algorithm and the word regex token can be configured in the config.json. The default is n = 1, word = /\S/.
The application also supports several other algorithms in the scikit learn module: Gaussian Naive-Bayes, Stochastic Gradient Descent, Random Forest. However, these were added mostly for quick testing. They have not been configured or optimized, and the results were vastly inferior to LinearSVM, as expected.