NanoOWL

👍 Usage - ⏱️ Performance - 🛠️ Setup - 🤸 Examples
- 👏 Acknowledgment - 🔗 See also

NanoOWL is a project that optimizes OWL-ViT to run 🔥 real-time 🔥 on NVIDIA Jetson Orin Platforms with NVIDIA TensorRT. NanoOWL also introduces a new "tree detection" pipeline that combines OWL-ViT and CLIP to enable nested detection and classification of anything, at any level, simply by providing text.

Interested in detecting object masks as well? Try combining NanoOWL with NanoSAM for zero-shot open-vocabulary instance segmentation.

👍 Usage

You can use NanoOWL in Python like this

from nanoowl.owl_predictor import OwlPredictor

predictor = OwlPredictor(
    "google/owlvit-base-patch32",
    image_encoder_engine="data/owlvit-base-patch32-image-encoder.engine"
)

image = PIL.Image.open("assets/owl_glove_small.jpg")

output = predictor.predict(image=image, text=["an owl", "a glove"], threshold=0.1)

print(output)

Or better yet, to use OWL-ViT in conjunction with CLIP to detect and classify anything, at any level, check out the tree predictor example below!

See Setup for instructions on how to build the image encoder engine.

⏱️ Performance

NanoOWL runs real-time on Jetson Orin Nano.

Model †	Image Size	Patch Size	⏱️ Jetson Orin Nano (FPS)	⏱️ Jetson AGX Orin (FPS)	🎯 Accuracy (mAP)
OWL-ViT (ViT-B/32)	768	32	TBD	95	28
OWL-ViT (ViT-B/16)	768	16	TBD	25	31.7

🛠️ Setup

Install the dependencies
1. Install PyTorch
2. Install torch2trt
3. Install NVIDIA TensorRT
4. Install the Transformers library
```
python3 -m pip install transformers
```
5. (optional) Install NanoSAM (for the instance segmentation example)

Install the NanoOWL package.

git clone https://github.com/NVIDIA-AI-IOT/nanoowl
cd nanoowl
python3 setup.py develop --user

Build the TensorRT engine for the OWL-ViT vision encoder

mkdir -p data
python3 -m nanoowl.build_image_encoder_engine \
    data/owl_image_encoder_patch32.engine

Run an example prediction to ensure everything is working

cd examples
python3 owl_predict.py \
    --prompt="[an owl, a glove]" \
    --threshold=0.1 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

That's it! If everything is working properly, you should see a visualization saved to data/owl_predict_out.jpg.

🤸 Examples

Example 1 - Basic prediction

This example demonstrates how to use the TensorRT optimized OWL-ViT model to detect objects by providing text descriptions of the object labels.

To run the example, first navigate to the examples folder

cd examples

Then run the example

python3 owl_predict.py \
    --prompt="[an owl, a glove]" \
    --threshold=0.1 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

By default the output will be saved to data/owl_predict_out.jpg.

You can also use this example to profile inference. Simply set the flag --profile.

Example 2 - Tree prediction

This example demonstrates how to use the tree predictor class to detect and classify objects at any level.

To run the example, first navigate to the examples folder

cd examples

To detect all owls, and the detect all wings and eyes in each detect owl region of interest, type

python3 tree_predict.py \
    --prompt="[an owl [a wing, an eye]]" \
    --threshold=0.15 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

By default the output will be saved to data/tree_predict_out.jpg.

To classify the image as indoors or outdoors, type

python3 tree_predict.py \
    --prompt="(indoors, outdoors)" \
    --threshold=0.15 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

To classify the image as indoors or outdoors, and if it's outdoors then detect all owls, type

python3 tree_predict.py \
    --prompt="(indoors, outdoors [an owl])" \
    --threshold=0.15 \
    --image_encoder_engine=../data/owl_image_encoder_patch32.engine

Example 3 - Tree prediction (Live Camera)

This example demonstrates the tree predictor running on a live camera feed with live-edited text prompts. To run the example

Ensure you have a camera device connected

Launch the demo

cd examples/tree_demo
python3 tree_demo.py ../../data/owl_image_encoder_patch32.engine

Second, open your browser to http://<ip address>:7860
Type whatever prompt you like to see what works! Here are some examples
- Example: [a face [a nose, an eye, a mouth]]
- Example: [a face (interested, yawning / bored)]
- Example: (indoors, outdoors)

👏 Acknowledgement

Thanks to the authors of OWL-ViT for the great open-vocabluary detection work.

🔗 See also

NanoSAM - A real-time Segment Anything (SAM) model variant for NVIDIA Jetson Orin platforms.
Jetson Introduction to Knowledge Distillation Tutorial - For an introduction to knowledge distillation as a model optimization technique.
Jetson Generative AI Playground - For instructions and tips for using a variety of LLMs and transformers on Jetson.
Jetson Containers - For a variety of easily deployable and modular Jetson Containers

galaxy-iot/nanoowl