Multimodal AI Essentials

Welcome to the "Multimodal AI Essentials" code repository! In this repo, we will learn how multimodal AI merges text, image, and audio for smarter models.

Much of the code in these sessions will be featured in the 2nd edition of my latest book on LLMs:

so if you're itching for more, check it out and please leave a rating/review to tell me what you thought :)

For even more, check out my Expert Playlist!

Prerequisites

Intermediate - Advanced Python Skills: Comfort with Python is crucial as we'll be using it throughout the course to interact with Hugging Face tools and integrate NLP into practical examples.
Foundational Machine Learning Knowledge: You should have an understanding of core machine learning principles, as we’ll build upon these concepts when exploring advanced NLP techniques.nologies in dynamic and evolving data environments.

Installation

Clone this repository to your local machine.
Ensure you have set the following api keyes:

OpenAI key

You're all set to explore the notebooks!

Usage - Jupyter Notebooks

This project contains several Jupyter notebooks each focusing on a specific topic:

Intro to Multimodality: An introduction to multimodality with CLIP and SHAP-E
- Whisper: An introduction to using Whisper for audio transcription
- Llava: Using an open source mult-turn multimodal engine
Visual Q/A
- Download the data from my Dropbox here
- Constructing and Training our model
  - Local
  - Colab
- Using our VQA system
  - Local
  - Colab
A Sample Twilio App for Voice Messaging with AI
- Check out the README here
(Time Permitting) Multimodal Agents

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Book time with me on Intro!

If you have questions, I'm available on Intro :)