/nanoowl

A project that optimizes OWL-ViT for real-time inference with NVIDIA TensorRT.

Primary LanguagePythonApache License 2.0Apache-2.0

NanoOWL

👍 Usage - ⏱️ Performance - 🛠️ Setup - 🤸 Examples
- 👏 Acknowledgment - 🔗 See also

NanoOWL is a project that optimizes OWL-ViT to run 🔥 real-time 🔥 on NVIDIA Jetson Orin Platforms with NVIDIA TensorRT. NanoOWL also introduces a new "tree detection" pipeline that combines OWL-ViT and CLIP to enable nested detection and classification of anything, at any level, simply by providing text.

Interested in detecting object masks as well? Try combining NanoOWL with NanoSAM for zero-shot open-vocabulary instance segmentation.

👍 Usage

You can use NanoOWL in Python like this

from nanoowl.owl_predictor import OwlPredictor

predictor = OwlPredictor(
    "google/owlvit-base-patch32",
    image_encoder_engine="data/owlvit-base-patch32-image-encoder.engine"
)

image = PIL.Image.open("assets/owl_glove_small.jpg")

output = predictor.predict(image=image, text=["an owl", "a glove"], threshold=0.1)

print(output)

Or better yet, to use OWL-ViT in conjunction with CLIP to detect and classify anything, at any level, check out the tree predictor example below!

See Setup for instructions on how to build the image encoder engine.

⏱️ Performance

NanoOWL runs real-time on Jetson Orin Nano.

Model † Image Size Patch Size ⏱️ Jetson Orin Nano (FPS) ⏱️ Jetson AGX Orin (FPS) 🎯 Accuracy (mAP)
OWL-ViT (ViT-B/32) 768 32 TBD 95 28
OWL-ViT (ViT-B/16) 768 16 TBD 25 31.7

🛠️ Setup

  1. Install the dependencies

    1. Install PyTorch

    2. Install torch2trt

    3. Install NVIDIA TensorRT

    4. Install the Transformers library

      python3 -m pip install transformers
    5. (optional) Install NanoSAM (for the instance segmentation example)

  2. Install the NanoOWL package.

    git clone https://github.com/NVIDIA-AI-IOT/nanoowl
    cd nanoowl
    python3 setup.py develop --user
  3. Build the TensorRT engine for the OWL-ViT vision encoder

    mkdir -p data
    python3 -m nanoowl.build_image_encoder_engine \
        data/owl_image_encoder_patch32.engine
  4. Run an example prediction to ensure everything is working

    cd examples
    python3 owl_predict.py \
        --prompt="[an owl, a glove]" \
        --threshold=0.1 \
        --image_encoder_engine=../data/owl_image_encoder_patch32.engine

That's it! If everything is working properly, you should see a visualization saved to data/owl_predict_out.jpg.

🤸 Examples

Example 1 - Basic prediction

This example demonstrates how to use the TensorRT optimized OWL-ViT model to detect objects by providing text descriptions of the object labels.

To run the example, first navigate to the examples folder

cd examples

Then run the example

python3 owl_predict.py \
    --prompt="[an owl, a glove]" \
    --threshold=0.1 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

By default the output will be saved to data/owl_predict_out.jpg.

You can also use this example to profile inference. Simply set the flag --profile.

Example 2 - Tree prediction

This example demonstrates how to use the tree predictor class to detect and classify objects at any level.

To run the example, first navigate to the examples folder

cd examples

To detect all owls, and the detect all wings and eyes in each detect owl region of interest, type

python3 tree_predict.py \
    --prompt="[an owl [a wing, an eye]]" \
    --threshold=0.15 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

By default the output will be saved to data/tree_predict_out.jpg.

To classify the image as indoors or outdoors, type

python3 tree_predict.py \
    --prompt="(indoors, outdoors)" \
    --threshold=0.15 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

To classify the image as indoors or outdoors, and if it's outdoors then detect all owls, type

python3 tree_predict.py \
    --prompt="(indoors, outdoors [an owl])" \
    --threshold=0.15 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

Example 3 - Tree prediction (Live Camera)

This example demonstrates the tree predictor running on a live camera feed with live-edited text prompts. To run the example

  1. Ensure you have a camera device connected

  2. Launch the demo

    cd examples/tree_demo
    python3 tree_demo.py ../../data/owl_image_encoder_patch32.engine
  3. Second, open your browser to http://<ip address>:7860

  4. Type whatever prompt you like to see what works! Here are some examples

    • Example: [a face [a nose, an eye, a mouth]]
    • Example: [a face (interested, yawning / bored)]
    • Example: (indoors, outdoors)

👏 Acknowledgement

Thanks to the authors of OWL-ViT for the great open-vocabluary detection work.

🔗 See also