This repo contains sample code to use Gpt-4o for multimodel use cases. I also got this integrated with Langchain (https://www.langchain.com/) which is a popular framework to build Gen AI applications.

GPT-4o ("o" for "omni") is designed to handle a combination of text, audio, and video inputs, and can generate outputs in text, audio, and image formats

Currently, the API supports {text, image} inputs only, with {text} outputs, the same modalities as gpt-4-turbo. Additional modalities, including audio, will be introduced soon.

Getting started:

  • Clone this repo to your local machine / VM : git clone https://github.com/cwijayasundara/gpt-4o-multimodality.git
  • Make sure you have Python 3.11 or above installed on your machine / VM
  • Create a .env file with the below key & value. Use your OpenAI key here!
  • OPENAI_API_KEY='sk-********'
  • More information about getting an OpenAI key and setting up a .env can be found here. https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key
  • Install the dependencies required for the project by executing pip install -r requirements.txt from the root
  • You would need two python packages for video processing - opencv-python and moviepy. These are provided in the requirements.txt
  • These require ffmpeg, so make sure to install this beforehand. Depending on your OS, you may need to run brew install ffmpeg or sudo apt install ffmpeg
  • After the dev set up is done just cd to the langchain dir and then python gpt_4o_research_images.py to execute the image processor.
  • Enjoy!!