/CLIP-Finder2

CLIP-Finder enables semantic offline searches of images from gallery photos using natural language descriptions or the camera. Built on Apple's MobileCLIP-S0 architecture, it ensures optimal performance and accurate media retrieval.

Primary LanguageSwiftMIT LicenseMIT

CLIP-Finder

CLIP-Finder.mp4

CLIP-Finder is an iOS application that leverages advanced AI models to perform image similarity searches. It utilizes two CoreML models optimized for the Apple Neural Engine, ensuring efficient on-device processing. The app allows users to search for images from the photo gallery using natural language descriptions or live camera input. All searches are completely offline, providing a user-friendly interface for searching and profiling while taking full advantage of Apple's cutting-edge AI capabilities.

This project is based on Apple's MobileCLIP architecture. Details of the architecture can be found in the following paper. The selected subarchitecture is MobileCLIP-S0, finding consistency with the latency times of the Image/Text encoders reported by the authors. The ImageEncoder is based on FastViT with some minor modifications. The general architecture of the two approaches implemented in CLIP-Finder is presented below:

Text Architecture Video Architecture

Features

  • Text-based image search
  • Image-based search using iPhone's front or rear camera
  • Two CoreML models optimized for the Apple Neural Engine:
    • CLIP Image Model
    • CLIP Text Model
  • GPU-accelerated image preprocessing using MPSGraph
  • Similarity calculation using dot product in MPSGraph
  • Model profiling for performance analysis across different compute units
  • Cache management for optimized performance
  • Tokenizer: Implementation in Swift based on the Tokenizer written in Python from open_clip

Components

  1. AI Models:

    • CLIP Image Model (CoreML, optimized for Apple Neural Engine)
    • CLIP Text Model (CoreML, optimized for Apple Neural Engine)
  2. Image Processing:

    • Preprocessing: Utilizes MPSGraph for efficient GPU-based image preparation
    • Postprocessing: Employs MPSGraph for similarity calculations using dot product, and selects the photos with the highest similarity scores
  3. User Interface:

    • Main view for search operations
    • Settings view for cache management and model profiling
  4. Data Management:

    • Core Data for efficient storage and retrieval of processed image features
  5. Asynchronous Image Prediction (Turbo Mode)

    CLIP-Finder2 now includes an experimental asynchronous image prediction feature, also known as "Turbo Mode". This feature can be activated through a button in the camera interface.

    • Faster image processing: Turbo Mode enables asynchronous camera prediction, potentially speeding up the image search process.
    • Activation: To activate, tap the "Turbo" button in the lower right corner of the camera interface.
    • For more information on asynchronous prediction in Core ML, refer to this WWDC 2023 session: Improve Core ML integration with async prediction

    ⚠️ WARNING: Turbo Mode is faster but may cause the app to freeze momentarily. Use with caution.

Core ML Packages

This section describes the variations of the Core ML packages available. These packages are designed to provide different levels of performance and accuracy, suitable for a variety of applications.

The Core ML packages are available at: 🤗 MobileCLIP on HuggingFace.

Core ML Conversion Scripts

This section provides details on the CoreML conversion scripts used for converting models to the CoreML format. The scripts are available as Jupyter Notebooks and can be found in the repository.

Scripts

  1. CLIPImageModel to CoreML Open In Colab

    • This notebook demonstrates the process of converting a CLIP image model to CoreML format.
  2. CLIPTextModel to CoreML Open In Colab

    • This notebook demonstrates the process of converting a CLIP text model to CoreML format.

Search Methods

  1. Text Search: Enter descriptive text to find matching images in your gallery
  2. Image Search: Use either the front or rear camera of your iPhone to capture an image and find similar photos

System Requirements

  • iOS 17 or later
  • iPhone with A12 Bionic chip or later (for optimal AI model performance on the Neural Engine)

Important Notes

  • It is recommended to turn off Low Power Mode for optimal performance, especially to fully utilize the Apple Neural Engine.
  • The app requires permission to access your photo gallery and camera.

Batch Processing

CLIP-Finder implements efficient batch processing to handle large photo galleries:

  • On app launch, the entire photo gallery is preprocessed using the Neural Engine with a batch size of 512 photos. This approach significantly speeds up the initial processing time.

  • When new photos are added to the device's gallery, CLIP-Finder detects and processes only the new images upon the next app launch. This incremental processing ensures that the app stays up-to-date with the latest additions to your photo library without redundant calculations.

  • Similarly, if photos are deleted from the device, CLIP-Finder updates its database accordingly during the next app launch. This cleanup process maintains the accuracy of the search results and optimizes storage usage.

Refer to this Blog for more details: 🤗 Hugging Face Blog

Settings

The Settings view provides two main functions:

  1. Clear Cache: Removes all preprocessed image data to free up storage space
  2. Model Profiler: Runs a performance analysis on both CoreML models across different computational units (CPU, GPU, Neural Engine, and combinations), allowing you to see the performance benefits of the Apple Neural Engine

This is the view for the CoreMLProfiler built-in app.

IMG_2143

Acknowledgments

This project is based on the architecture of the MobileCLIP, which is licensed under the MIT License. An acknowledgment is extended to the authors of the paper: Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, and Oncel Tuzel, for their valuable contributions.

License

This project is licensed under the MIT License. See the LICENSE file for more details.