gpt-4o-multimodality: A Python repository from cwijayasundara

This repo contains sample code to use Gpt-4o for multimodel use cases. I also got this integrated with Langchain (https://www.langchain.com/) which is a popular framework to build Gen AI applications.

GPT-4o ("o" for "omni") is designed to handle a combination of text, audio, and video inputs, and can generate outputs in text, audio, and image formats

Currently, the API supports {text, image} inputs only, with {text} outputs, the same modalities as gpt-4-turbo. Additional modalities, including audio, will be introduced soon.

Getting started:

Clone this repo to your local machine / VM : git clone https://github.com/cwijayasundara/gpt-4o-multimodality.git
Make sure you have Python 3.11 or above installed on your machine / VM
Create a .env file with the below key & value. Use your OpenAI key here!
OPENAI_API_KEY='sk-********'
More information about getting an OpenAI key and setting up a .env can be found here. https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key
Install the dependencies required for the project by executing pip install -r requirements.txt from the root
You would need two python packages for video processing - opencv-python and moviepy. These are provided in the requirements.txt
These require ffmpeg, so make sure to install this beforehand. Depending on your OS, you may need to run brew install ffmpeg or sudo apt install ffmpeg
After the dev set up is done just cd to the langchain dir and then python gpt_4o_research_images.py to execute the image processor.
Enjoy!!

cwijayasundara/gpt-4o-multimodality