FeatReq: add documentation on the exact format of gcs output files on batch processing
yan-hic opened this issue · 1 comments
yan-hic commented
One could use more details on how the service processes a batch response:
- what is the exact blob name format ? I would expect it to be derived from the input
uri
to tie the inout and output based on their same but the only info I found is at - the referenced code snippet indicates the service could return multiple json's for a given source file. What are the conditions, and can that be forced to be a single json instead ?
- request: the output metadata could return a "file format not supported" when e.g. trying to process a Excel or Word docs. On that note, supporting such formats could be a great addition to the service as there are no pdf open-source converters available.
Thanks !
yan-hic commented
Closing as for each point:
- decided to move/rename to the
source_uri
+.json
extension - cannot be forced into one. The service creates a shard (json in gcs) for every 10 pages
- the metadata returns enough details and status code