googleapis/python-documentai

FeatReq: add documentation on the exact format of gcs output files on batch processing

yan-hic opened this issue · 1 comments

One could use more details on how the service processes a batch response:

  1. what is the exact blob name format ? I would expect it to be derived from the input uri to tie the inout and output based on their same but the only info I found is at
    # output_gcs_destination format: gs://BUCKET/PREFIX/OPERATION_NUMBER/INPUT_FILE_NUMBER/
  2. the referenced code snippet indicates the service could return multiple json's for a given source file. What are the conditions, and can that be forced to be a single json instead ?
  3. request: the output metadata could return a "file format not supported" when e.g. trying to process a Excel or Word docs. On that note, supporting such formats could be a great addition to the service as there are no pdf open-source converters available.

Thanks !

Closing as for each point:

  1. decided to move/rename to the source_uri + .json extension
  2. cannot be forced into one. The service creates a shard (json in gcs) for every 10 pages
  3. the metadata returns enough details and status code