Unstructured-IO/unstructured-api

Output dir doesn't work for extracted images

Closed this issue · 5 comments

I'm trying to save the extracted images to a dir but it seems two parameters are not working:

  1. extract_image_block_to_payload: if set to false the image base64 still shows up in the response
  2. extract_image_block_output_dir: I think it was not mapped.

I think that's an important step offloading all data to a folder instead of putting it all in the response.

@atmoraes1 Can you provide the code snippet you're trying?

It is a simple request using Postman.

The cURL is as follows:

curl --location 'http://localhost:8000/general/v0/general' \
--form 'files=@"<path-to-file>"' \
--form 'coordinates="false"' \
--form 'languages="por"' \
--form 'pdf_infer_table_structure="true"' \
--form 'extract_image_block_types="[\"Image\"]"' \
--form 'strategy="hi_res"' \
--form 'extract_image_block_to_payload="false"' \
--form 'extract_image_block_output_dir="/tmp"'

I expected the image files to be stored in the /tmp dir, but they don't get saved, instead they are returned as base64 in the response.

Yes, it is not supported to store image files when using the API. You'll need to use the "unstructured" library directly in your code to store image files.

IMHO it doesn't make sense to build a shim just to save images to an output dir as we already have unstructured as an API.

Can I proceed with a new PR for that?

Unfortunately that doesn't quite fit with our design of the server. In most cases, the client making the request isn't on the same host, and so passing an output_dir wouldn't provide access to the files. What we can do here is update the partition_via_api function to make the remote partition call, and then pull the images back out of the response. I'll close this and make a new issue.