googleapis/google-cloud-php

batchAnnotateFiles failing silently (and taking php thread with it)

James-THEA opened this issue · 2 comments

Environment details

  • OS: Amazon Linux 2023
  • PHP version: 8.2.15
  • Package name and version: v1.9.0

Steps to reproduce

  1. Use this file:
    faraone2005 (1).pdf

  2. Request pages 1-10
    a. Two batches of 5 pages. It works if I do only 1-9.

More context:
I have a setup to parse PDFs that relies on the Google Cloud Vision API. It has worked for the past several months, and anecdotally this is a new issue. There is no error thrown, and the PHP thread just dies.

Moreover, the issue doesn't exist in all my environments. Locally, everything works great (PHP version 8.2.4). On an Amazon Beanstalk server it works as well (some versions as listed above). The issue exists on both new and old servers that we have spun up. That means there is a possible solution of finding the discrepancy between the servers and updating the problem; however, I still think this should be filed as a bug.

I have added memory usage logging, and nothing appears that crazy (>100MB). It does spike on the first request using batchAnnotateFiles and then dies on the second request, so it is possible it spikes again (as I strongly suspect a memory limit is the problem).

I found this bug report: https://www.googlecloudcommunity.com/gc/AI-ML/Vision-AI-OCR-Internal-server-error-Failed-to-process-features/m-p/735441

It looks almost identical to my issue, but it is for Vision AI, so the fix is not applicable

Code example

A little edited for brevity, but I can confirm it still has the problem.

private function myFunction($filePath, int $startingPage, int $lastPage): FileUploadResponse {
        $pdfContent = \Storage::get($filePath);
        $inputConfig = (new InputConfig())
            ->setMimeType('application/pdf')
            ->setContent($pdfContent);
        $feature = (new Feature())->setType(Type::DOCUMENT_TEXT_DETECTION);

        $totalPages = range($startingPage + 1, $lastPage + 1);
        $pageChunks = array_chunk($totalPages, 5);
        $overallText = '';
        $maxLength = self::MAX_UPLOAD_TEXT_LENGTH;        
        
        for ($chunk = 0; $chunk < count($pageChunks); $chunk++) {
            try {
                $imageAnnotator = new ImageAnnotatorClient(['credentials' => 'redacted']);
                $pages = $pageChunks[$chunk];
                $annotateFileRequest = (new AnnotateFileRequest())
                    ->setInputConfig($inputConfig)
                    ->setFeatures([$feature])
                    ->setPages($pages);
                try {
                    $response = $imageAnnotator->batchAnnotateFiles([$annotateFileRequest]); // request dies here
                } catch (\Exception $e) {
                    Logger(json_encode($e));
                }
                $responses = $response->getResponses()[0]->getResponses();

                for ($x = 0; $x < min(count($pages), count($responses)); $x++) {
                    $pageResponse = $responses[$x];
                    if ($pageResponse->hasError()) {
                        continue;
                    }
                    if ($pageResponse->getFullTextAnnotation() !== null) {
                        $overallText .= $pageResponse->getFullTextAnnotation()->getText();
                    }
                }
            } finally {
                $imageAnnotator->close();
                gc_collect_cycles();
            }
        }
        return new FileUploadResponse(text: $overallText);
    }

Adding some follow up investigation:

  • If we decrease the batch size to 1-4 pages, it works
  • If we don't chunk by pages and make one request, it works.

Hello @James-THEA! Thanks for the message.

This is an interesting issue, I find it odd that the thread just dies. What are the new and old servers you mentioned? Physical servers that you have?

Are you still having the same issue or has there been any change recently?