Chipperv2 outputs incorrect table structure and text
six5532one opened this issue · 1 comments
six5532one commented
Describe the bug
Yokebe.pdf
Bebevita.pdf
Mykoforte.pdf
- Yokebe.pdf: chipperv2 separates the header of the table on page 2 as a separate table and there are multiple (OCR?) errors in the text (e.g. the "μ" in "μg", etc.)
- Bebivita.pdf: no tables found
- Mykoforte.pdf: chipperv2 found one table and did not detect others. There are issues in the detected table structure and text.
To Reproduce
See attached documents.
A user used the hosted API with the chipperv2 model. They also tried setting "languages" to "['deu']" and "OCR_AGENT" to "paddle" but noticed no difference. Here is their code:
import requests
unstructured_api_key = '.............'
unstructured_api_headers = {
"accept": "application/json",
"unstructured-api-key": unstructured_api_key
}
unstructured_api_url = "https://api.unstructured.io/general/v0/general"
data = {
"strategy": "hi_res",
"pdf_infer_table_structure": "true",
"hi_res_model_name": "yolox", --> change to chipperv2
"languages": "['eng']"
}
file_path = "..............."
file_data = {'files': open(file_path, 'rb')}
response = requests.post(url=unstructured_api_url,
files=file_data,
data=data,
headers=unstructured_api_headers)
MthwRobinson commented
Closing this because Chipper is only supported in the SaaS API