[Feature Request] Support for Text Recognition in Images for Vision Models

Question

[Feature Request] Support for Text Recognition in Images for Vision Models

Closed this issue 6 months ago · 6 comments

Hello,

I am reaching out to propose an enhancement for models that support vision capabilities, focusing on Optical Character Recognition (OCR). This functionality would enable models to more dynamically process and understand visual data by identifying and analyzing text in uploaded images.

Proposed Feature:
I suggest introducing a feature that allows users to upload images containing text, and the model processes these images to recognize and understand the text. This improvement would significantly increase the utility of the model in applications that require document analysis, informational signage, road signs, and other visual forms of communication.

Potential Benefits:
Enables models to process and interpret text from images in real-time.
Increases the accuracy and capabilities of the model in scenarios related to document automation and analysis.
Enhances accessibility and interactivity for users who could use the model to translate text from images into various languages.

Use Case Example:
In the context of office automation, this feature could automatically analyze and process forms, invoices, and other business documents submitted as images, significantly streamlining business processes.

Answer 1 · 2024-05-01T07:07:27.000Z

Great idea!

Can you draft your preferred API for that? It does not have to be perfect or final - I'd like to figure out what direction we might want to take with this.

Answer 2 · 2024-05-02T14:52:30.000Z

I like this idea too. My work is bringing me toward Anthropic's Claude, but also Amazon Bedrock which offers access to Anthropic and other models.

Answer 3 · 2024-05-03T13:09:24.000Z

The way I see it, your $instructor->respond/request would take an Image array as an additional parameter, as I can see the need to include several images. For example, an Image object could look like this:

class Image
{
    private string $mediaType;
    private string $dataBase64;
    private string $name;

    private function __construct(string $mediaType, string $dataBase64, string $name)
    {
        $this->mediaType = $mediaType;
        $this->dataBase64 = $dataBase64;
        $this->name = $name;
    }

    public static function createFromUrl(string $url, string $name): self
    {
        $imageContent = file_get_contents($url);
        $finfo = new \finfo(FILEINFO_MIME_TYPE);
        $mimeType = $finfo->buffer($imageContent);

        if (!str_starts_with($mimeType, 'image/')) {
            throw new \InvalidArgumentException('URL does not point to an image');
        }

        $dataBase64 = base64_encode($imageContent);

        return new self($mimeType, $dataBase64, $name);
    }

    public static function createFromBase64(string $dataBase64, string $name): self
    {
        $imageContent = base64_decode($dataBase64);
        $finfo = new \finfo(FILEINFO_MIME_TYPE);
        $mediaType = $finfo->buffer($imageContent);

        if (!str_starts_with($mediaType, 'image/')) {
            throw new \InvalidArgumentException('Provided data does not represent an image');
        }

        return new self($mediaType, $dataBase64, $name);
    }

    public static function createFromPath(string $path, string $name): self
    {
        if (!file_exists($path)) {
            throw new \InvalidArgumentException('File does not exist');
        }

        $imageContent = file_get_contents($path);
        $finfo = new \finfo(FILEINFO_MIME_TYPE);
        $mediaType = $finfo->buffer($imageContent);

        if (!str_starts_with($mediaType, 'image/')) {
            throw new \InvalidArgumentException('File is not an image');
        }

        $dataBase64 = base64_encode($imageContent);

        return new self($mediaType, $dataBase64, $name);
    }

    public function toArray(): array
    {
        return [
            'mediaType' => $this->mediaType,
            'dataBase64' => $this->dataBase64,
            'name' => $this->name, // Add new field to array
        ];
    }

    public function getMediaType(): string
    {
        return $this->mediaType;
    }

    public function getDataBase64(): string
    {
        return $this->dataBase64;
    }

    public function getName(): string
    {
        return $this->name;
    }
}

Antropic message payload like (https://docs.anthropic.com/claude/docs/vision):

public function getCompletionsMessagesToArray(array $messages): array
    {
        $messagesArray = [];
        foreach ($messages as $message) {
            if (!empty($message->getImages())) {
                $contentArray = [];
                /** @var Image $image */
                foreach ($message->getImages() as $image) {
                    $contentArray[] = [
                        'type' => 'text',
                        'text' => 'Image name: '. $image->getName(),
                    ];
                    $contentArray[] = [
                        'type' => 'image',
                        'source' => [
                            'type' => 'base64',
                            'media_type' => $image->getMediaType(),
                            'data' => $image->getDataBase64(),
                        ],
                    ];
                }
                $contentArray[] = [
                    'type' => 'text',
                    'text' => $message->getContent(),
                ];
                $messagesArray[] = [
                    'role' => $message->getRole()->value,
                    'content' => $contentArray,
                ];
            } else {
                $messagesArray[] = [
                    'role' => $message->getRole()->value,
                    'content' => $message->getContent()
                ];
            }
        }

Answer 4 · 2024-05-15T07:20:24.000Z

Thanks! I've been refactoring LLM client over last 2 weeks - it's still in progress, so let me take a look how something like you described could be integrated into the new code.

Answer 5 · 2024-05-19T18:29:08.000Z

Check this example:
https://cognesy.github.io/instructor-php/hub/techniques/image_to_data/

I think it demonstrates what you're trying to achieve is possible with Instructor already.

In this example, I'm taking some receipt from web:
https://www.inogic.com/blog/wp-content/uploads/2020/09/Receipt-Processor-AI-Builder-in-Canvas-App-9.png

For the demo I defined following receipt data model:

class Vendor {
    public ?string $name = '';
    public ?string $address = '';
    public ?string $phone = '';
}

class ReceiptItem {
    public string $name;
    public ?int $quantity = 1;
    public float $price;
}

class Receipt {
    public Vendor $vendor;
    /** @var ReceiptItem[] */
    public array $items = [];
    public ?float $subtotal;
    public ?float $tax;
    public ?float $tip;
    public float $total;
}

See example code on how to get image data and create appropriate structure of messages parameter.
https://github.com/cognesy/instructor-php/tree/main/examples/03_Techniques/ImageToData

As a result, I'm getting following data:

^ Receipt^ {#1361
  +vendor: Vendor^ {#1370
    +name: "Geo Restaurant"
    +address: "300 72th Street Miami Beach Fl 33141"
    +phone: "305-864-5586"
  }
  +items: array:7 [
    0 => ReceiptItem^ {#1394
      +name: "Ferrari Carano"
      +quantity: 1
      +price: 47.0
    }
    1 => ReceiptItem^ {#1396
      +name: "Insalate Cesare"
      +quantity: 1
      +price: 7.5
    }
    2 => ReceiptItem^ {#1389
      +name: "Caprese with prosciutto"
      +quantity: 1
      +price: 9.5
    }
    3 => ReceiptItem^ {#1395
      +name: "FISH SPEC"
      +quantity: 1
      +price: 25.95
    }
    4 => ReceiptItem^ {#1401
      +name: "Spinach Ricotta Ravioli"
      +quantity: 1
      +price: 15.95
    }
    5 => ReceiptItem^ {#1402
      +name: "Seafood Pasta"
      +quantity: 1
      +price: 19.95
    }
    6 => ReceiptItem^ {#1403
      +name: "Ossobuco"
      +quantity: 1
      +price: 29.95
    }
  ]
  +subtotal: 155.8
  +tax: 14.02
  +tip: 5.0
  +total: 169.82
}

Next steps would be to define validations, e.g. summarize items and make sure they total to subtotal. Etc.

What do you think?

Answer 6 · 2024-05-19T19:22:41.000Z

Looks good, thanks!