/BusinessCard-OCR

Business Card OCR

Primary LanguagePython

Business Card OCR


Table of Contents

  1. Preface
  2. Quickstart
  3. Engineering and Design

Preface

This project was designed in Python, version 3.6, and the application can either be run standalone in a terminal or within a Docker container. The reason behind adding a container option is installing Python 3.6 could be cumbersome to a tester and with the growth of containerization it is a viable avenue for the presentation and execution of this project.

Tool Stack

The project uses:

  • Python 3.6 - Python Interpreter

    • PyYAML - a Python YAML parser (used solely for testing)
  • Docker - Container Utility

    • CentOS 7 - Base Container Image
    • dumb-init - Container initialization system
    • Python 3.6 and dependencies (above)

Platform Compliance

The project has been tested on:

  • Mac OS X
  • CentOS 7 (containerized)

Application Help

usage: __main__.py [-h] [-d DOCUMENT] [--test]

Simple Business Card OCR

optional arguments:
  -h, --help            show this help message and exit
  -d DOCUMENT, --document DOCUMENT
                        pass the document string to be parsed
  --test                run test cases

Quickstart

For ease of use and so that one does not have to directly modify their development environment to test/run this application, a Dockerfile was implemented. All commands should be executed in a Bash terminal while in the root directory of the BusinessCard-OCR project. The required dev tools can be located in the Tool Stack (under Preface).

Docker Usage

Requires: Docker

# Build docker image
docker build -t ocr .

# Run test suite
docker run ocr

# Run sample document string
docker run ocr --document $'ASYMMETRIK LTD\nMike Smith\nSenior Software Engineer\n(410)555-1234\nmsmith@asymmetrik.com'

# Access help menu
docker run ocr -h

Terminal Usage

Requires: Python 3.6, pip3.6 (default with Python)

# Install PyPI dependencies
pip3.6 install -r requirements.txt

# Run test suite
python3.6 -m ocr --test

# Run sample document string
python3.6 -m ocr --document $'ASYMMETRIK LTD\nMike Smith\nSenior Software Engineer\n(410)555-1234\nmsmith@asymmetrik.com'

# Access help menu
python3.6 -m ocr -h

Notes:

  • The above --document flag examples use an ANSI-C quoted string with the first example input provided at the challenges site; this string type is needed so that the escaped characters can be parsed properly.

  • Executing a docker run without specifiying the --name flag will give the container an auto-generated name

  • Using the docker run commands with -d, prior to the image name, quiets the output which can be viewed later with docker logs


Engineering and Design

This section will review the design choices made, road blocks hit, and solutions developed when working on this project.

Code Implementation

For the actual implementation of the OCR, a single Python module (ocr/parser.py) worked well; having both the BusinessCardParser and _ContactInfo classes in one file lead to easier comprehension of the implementation. Compliance with PEP8 was key throughout the project.

  • Class: BusinessCardParser

    • Has a one "public" accessor method: get_contact_info()
    • Returns a _ContactInfo object from the get_contact_info() method
    • _filter_fields() method uses the _ContactInfo.___slots___ attributes as a search reference
    • _filter_fields() method returns a dictionary purpose built to populate a _ContactInfo object
    • The fields email and phone are parsed independently, while the name field requires a populated email field
  • Class: _ContactInfo

    • Has three "public" accessor methods: get_name(), get_phone_number(), and get_email_address()
    • Uses the __slots__ attribute to minimize memory profile
    • kwargs** is explicitly used to populate object on initialization
    • All variable attributes are name mangled so that cannot be accessed directly
    • Getters all call str() on their respective attributes, adding redundancy if their attribute object ever changes from str type

Parsing Algorithm

The initial checks were quick to implement for both email and phone, however the name parsing was more of a challenge. The implementions discussed below can be found in the method _filter_fields() under the class BusinessCardParser.

Initial Assumptions

Based on the example data sets, it was determined that there are likely to be minimal, if any, oddities in the data sets due to the professional nature of the medium (business cards); the data sets are also assumed to be well formed.

Email Address

Simple research into email parsing led to the site emailregex.com which had the exact regex string to be used in Python to get an exceptionally accurate match rate. A starting line check for an "@" symbol in the current line followed by a assessment that a "." is in the same token, leads into a regex validation of the string.

Phone Number

What stood out from the example data sets provided was there could be more than one number that would meet the qualifications for a telephone number (ex: a fax number). After some research, E.164 , an international telecom numbering standard, revealed phone number regulations with a maximum digit length of 15, but no minimum. Based off the American standard, if a digit count in a line is greater than 6, and "fax" is not found in the same line, then it is valid. If multiple phone numbers are provided, such as "cell" and "office", only the last one processed will be stored.

Name

Attempting to extract the name field independent of the data points proved to be difficult. Research led to an article on Name Entity Recognition or NER, a problem set under Natural Language Processing; this continued on into implementations of Python's nltk (Natural Language Toolkit) and Stanford's CoreNLP server. Passing full documents into the CoreNLP server returned proper entity sets up until the "Arthur Wilson" (third example input) set. Reassessing the data set as a whole led to the realization that the email field's username segment could be used to validate the name. To view the code prior to the current implemtation using the CoreNLP server and entity tagging go to commit c37bf28!

Open Source Tools

The only open source library used in the Python code is PyYAML; this is done solely for parsing the YAML files in /tests/samples/ . The regex string provided by emailregex.com is used to validate extracted emails in the BusinessCardParser class. The Dockerfile is built off of CentOS 7 with a library called dumb-init to allow for cleaner processes and signal handling.

Test Cases

Two sample files are provided, asymmetrik.yaml which contains the three examples provided at the challenges site, and custom.yaml with three examples containing obfuscated data and formats modeled off of real business cards. The format for each YAML file is multiple top level examples containing their document and output attributes, both types being multi-line strings. The test_parser module in /tests loads these files in their respective methods ( test_asymmetrik(), test_custom() ) then uses a BusinessCardParser object to process the example's documents and compare the output to the expected example output.