Multimodal (and multi-lingual) search demo

This app demonstrates Weaviate's multi-modal capabilities.

It uses a CLIP model to encode images and text into the same vector space.


To run this example, you need:

  • Docker (to run Weaviate)
  • Python 3.8 or higher

Setup instructions

  1. Install the Python dependencies.
    pip install -r requirements.txt
  2. Run docker-compose to spin up an Weaviate instance and the CLIP inference container.
    docker compose up -d
  3. Create the collection definition and import data, as well as some pre-prepared queries.
  4. Start the Streamlit app.
    streamlit run

Usage instructions

Input a search query into the text box, or upload an image.


This will return the top 6 results from the Weaviate instance. The results are sorted by the cosine similarity between the query and the vector representation of the object.

Example search results for an image query: mm_demo_by_img

Example search results for a text query: mm_demo_by_text

Note - The model used is multi-lingual! That means it can understand queries in multiple languages. Try a search with an image, and then try inputting a description for that image in different languages!

Dataset license

Universe image from Unsplash

Forest image from Unsplash