Unstructured-IO/unstructured-api-tools

Ability to accept gzip compressed files

cragwolfe opened this issue · 0 comments

Summary

Currently, when multiple files are passed to an unstructured-api-tools (auto-generated) Pipeline API, the files are presumed to be uncompressed and the files and their content types are passed along to pipeline_api.

However, the consumer of the API should have the ability to submit gzip compressed files as well. See the spec for details. To be clear, this issue is about gracefully handling (potentially) gzip'ed files in the FastAPI interface and passing uncompressed files or the uncompressed text content of files to pipeline_api.

Note: Ideally #104 is at least partially completed first, but this is not a hard blocker.

Definition of Done

  • gzipped files may be submitted to the API in either the text_files or files form parameters, per the spec.
  • Unittests are added, including for a request that includes an input with both text_files and files compressed and uncompressed files.
  • Unittests show the ability to infer the file_content_type to pass to pipeline_api if gz_uncompressed_content_type is not provided.
  • Test instructions demonstrate compressed files being appropriated handled in a locally running pipeline-sec-filings API, including mixed compressed and uncompressed files submitted in the same request.
  • Test instructions demonstrate compressed files being appropriated handled in a locally running pipeline API that accepts files (in contrast to the text_files input in the sections API of pipeline-sec-filings) API, including mixed compressed and uncompressed files submitted in the same request.
  • Test instructions demonstrate compressed files being appropriated handled in a locally running pipeline API that accepts a file OR a text file(e.g. def pipeline_api(text, file, ...)) , including mixed compressed and uncompressed files submitted in the same request.