NielsRogge/Transformers-Tutorials

Suggestion for better way to apply OCR on table cells

abielr opened this issue · 1 comments

Hi @NielsRogge, thank you for your recent notebook on table detection and structure recognition. In the notebook you apply OCR at the end cell by cell. Because the table structure recognition model will sometimes draw the row and column dividing lines in a way that slightly cuts off the text in a cell, this can lead to bad OCR results, since the OCR is often being applied to text that is slightly cut off. A better way is to first apply OCR on the entire image, get the bounding boxes for each piece of text, and then figure out what cell that text should sit in by determining which cell has the maximum overlap with the text bounding box.

Fortunately the table-transformers repo already has code for doing all the processing after the OCR step in the inference.py file, see in particular the TableExtractionPipeline.recognize() function. For the OCR step, the Unstructured project has examples here that demonstrate OCR using both Tesseract and Paddle, though this could be easily adapted to other OCR libraries. Also, if the table in your PDF is not an embedded image, then an alternative to OCR is to directly extract the text with a PDF library, which may be faster and more accurate than OCR. The PDF library will return the bounding boxes of the text relative to the page and assuming 72 DPI, after which you need translate the coordinates to align with image of the page, filter only that text which sits within the detected tables, and then align the coordinates of the text to be relative to the upper left of each table. I couldn't find an example of this in any existing library, but I demonstrate this approach in the sample code below.

Below is some code that has the key functions. Credit to the Unstructured project for the OCR functions, which I've tweaked just slightly to make standalone for this example.


# Assume key functions from table-transformers inference.py file have been imported in a module called `utils`, 
# then after getting the `tokens` list from one of the functions farther below you could run the following code
# after table structure recognition has occurred. See the TableExtractionPipeline.recognize() function in 
# https://github.com/microsoft/table-transformer/blob/main/src/inference.py for more detail. 

tables_structure = utils.objects_to_structures(objects, tokens, utils.structure_class_thresholds)
tables_cells = [utils.structure_to_cells(structure, tokens)[0] for structure in tables_structure]
tables_htmls = [utils.cells_to_html(cells) for cells in tables_cells]
tables_csvs = [utils.cells_to_csv(cells) for cells in tables_cells]

# Functions for doing OCR or PDF text extraction 
import pandas as pd
from paddleocr import PaddleOCR
import pytesseract
from pytesseract import Output
from PIL import Image
import PIL
import numpy as np
import fitz

def parse_ocr_data_paddle(ocr_data: list) -> list:
    text_regions = []
    for idx in range(len(ocr_data)):
        res = ocr_data[idx]
        if not res:
            continue

        for line in res:
            x1 = min([i[0] for i in line[0]])
            y1 = min([i[1] for i in line[0]])
            x2 = max([i[0] for i in line[0]])
            y2 = max([i[1] for i in line[0]])
            text = line[1][0]
            if not text:
                continue
            cleaned_text = text.strip()
            if cleaned_text:
                text_region = dict(
                    bbox=[x1, y1, x2, y2],
                    text=cleaned_text
                )
                text_regions.append(text_region)

    return text_regions

def parse_ocr_data_tesseract(ocr_data: pd.DataFrame) -> list:
    zoom = 1

    text_regions = []
    for idtx in ocr_data.itertuples():
        text = idtx.text
        if not text:
            continue

        cleaned_text = str(text) if not isinstance(text, str) else text.strip()

        if cleaned_text:
            x1 = idtx.left / zoom
            y1 = idtx.top / zoom
            x2 = (idtx.left + idtx.width) / zoom
            y2 = (idtx.top + idtx.height) / zoom
            text_region = dict(
                bbox=[x1, y1, x2, y2],
                text=cleaned_text
            )
            text_regions.append(text_region)

    return text_regions

def get_ocr_layout_paddle(image: Image) -> list[dict]:
    # TODO: Make it so PaddleOCR() is only called once
    paddle_ocr = PaddleOCR(
        use_angle_cls=True,
        lang='en',
        enable_mkldnn=False
    )

    tokens = parse_ocr_data_paddle(paddle_ocr.ocr(np.array(image), cls=True))
    for idx, token in enumerate(tokens):
        if "span_num" not in token:
            token["span_num"] = idx
        if "line_num" not in token:
            token["line_num"] = 0
        if "block_num" not in token:
            token["block_num"] = 0
    return tokens

def get_ocr_layout_tesseract(image: Image) -> list[dict]:
    tokens = pytesseract.image_to_data(image, output_type=Output.DATAFRAME, lang='eng')
    tokens = tokens.dropna()
    tokens = parse_ocr_data_tesseract(tokens)
    for idx, token in enumerate(tokens):
        if "span_num" not in token:
            token["span_num"] = idx
        if "line_num" not in token:
            token["line_num"] = 0
        if "block_num" not in token:
            token["block_num"] = 0
    return tokens

def get_layout_pymupdf(page: fitz.Page, page_image: PIL.Image, table_objects: list[dict]) -> list[list[dict]]:
    """
    Extract text from a PDF page and translate the text bounding box coordinates
    to align with the image coordinates of the page, then filter tokens that are
    part of detected tables and update the bounding boxes to be relative to the
    table image coordinates

    Args:
        page (fitz.Page): PyMuPDF PDF page object
        page_image (PIL.Image): Image of the PDF page
        table_objects (list[dict]): A list of detected tables and their bounding boxes, as output by the table-transformers detection model

    Returns:
        list: List of words in each table with bounding boxes whose coordinates are relative to the table image
    """
    words = [list(x) for x in page.get_text('words')]
    # Rescale PDF coordinates to image coordinates
    x_scaling = page_image.size[0] / page.rect[2]
    y_scaling = page_image.size[1] / page.rect[3]
    for idx in range(len(words)):
        words[idx][0] = words[idx][0] * x_scaling
        words[idx][1] = words[idx][1] * y_scaling
        words[idx][2] = words[idx][2] * x_scaling
        words[idx][3] = words[idx][3] * y_scaling
    token_list = []
    for table_object in table_objects:
        tokens = []
        x1_, y1_, x2_, y2_ = table_object['bbox']
        for idx, rec in enumerate(words):
            x1, y1, x2, y2, text = rec[:5]
            if x1 >= x1_ and x2 <= x2_ and y1 >= y1_ and y2 <= y2_: # Filter only for words inside the table rectangle
                tokens.append(dict(
                    bbox=[x1-x1_, y1-y1_, x2-x1_, y2-y1_], # Shift from page image coordinates to table image coordinates
                    text=text.strip(),
                    span_num=idx,
                    line_num=0,
                    block_num=0
                ))
        token_list.append(tokens)
    return token_list

How to pass the cell coordinates into paddle ocr and get the csv output ? @abielr