We use BLIP2 as the multimodal pre-training method. BLIP2 is one of the SOTA models in multimodal pre-training method, and outperforms most of the existing methods in Visual Question Answering, Image Captioning and Image-Text Retrieval. For LLM, we will use Llama 2, the next generation open source large language model, which outperforms the existing open source language models on many benchmarks including reasoning, coding, proficiency, and knowledge tests.
We use dataset from Content Moderation with AWS AI Services to test how BLIP2 can detect unsafe content in the image and meanwhile give the explanation with effective prompts.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.