Unstructured-IO/unstructured-api

Chipperv2 outputs incorrect table structure and text

six5532one opened this issue · 1 comments

Describe the bug
Yokebe.pdf
Bebevita.pdf
Mykoforte.pdf

  • Yokebe.pdf: chipperv2 separates the header of the table on page 2 as a separate table and there are multiple (OCR?) errors in the text (e.g. the "μ" in "μg", etc.)
  • Bebivita.pdf: no tables found
  • Mykoforte.pdf: chipperv2 found one table and did not detect others. There are issues in the detected table structure and text.

To Reproduce
See attached documents.
A user used the hosted API with the chipperv2 model. They also tried setting "languages" to "['deu']" and "OCR_AGENT" to "paddle" but noticed no difference. Here is their code:

import requests

unstructured_api_key = '.............' 
unstructured_api_headers = {
    "accept": "application/json",
    "unstructured-api-key": unstructured_api_key
}

unstructured_api_url = "https://api.unstructured.io/general/v0/general"

data = {
    "strategy": "hi_res",
    "pdf_infer_table_structure": "true",
    "hi_res_model_name": "yolox", --> change to chipperv2
    "languages": "['eng']"
}

file_path = "..............."
file_data = {'files': open(file_path, 'rb')}

response = requests.post(url=unstructured_api_url,
                         files=file_data,
                         data=data,
                         headers=unstructured_api_headers)

Closing this because Chipper is only supported in the SaaS API