We explore how robots can perceive and understand their environment through the powerful combination of image understanding and natural language processing. This repository dives deep into the fascinating world of vision-language models for robotics applications, specifically utilizing the powerful Intel OpenVINO Toolkit.
This repository is presented as a workshop at the ROS meetup Lagos
- Ubuntu 22.04 or newer
- ROS 2 Humble or newer
- Python 3
- Intel OpenVINO toolkit
Please note that to run the code in this repository, you will need a device compatible with the Intel OpenVINO Toolkit. This typically includes Intel CPUs, Intel Neural Compute Sticks, or other Intel hardware supporting OpenVINO.
mkdir -p ~/ros2_ws/src
cd ~/ros2_ws
virtualenv -p python3 ./vlm-venv
source ./vlm-venv/bin/activate
# Make sure that colcon doesn’t try to build the venv
touch ./vlm-venv/COLCON_IGNORE
pip install timm --extra-index-url https://download.pytorch.org/whl/cpu # is needed for torch
pip install "openvino>=2024.1" "torch>=2.1" opencv-python supervision transformers yapf pycocotools addict "gradio>=4.19" tqdm
Make sure to update <<YOUR_USER_NAME>> with your system username.
export PYTHONPATH='/home/<<YOUR_USER_NAME>>/ros2_ws/vlm-venv/lib/python3.10/site-packages'
cd ~/ros2_ws/src
git clone https://github.com/nilutpolkashyap/vlms_with_ros2_workshop.git
cd ~/ros2_ws/src/vlms_with_ros2_workshop
python3 download_weights.py
Download the zip file from the Google Drive link here
Place the contents of the zip file inside the 'openvino_irs' directory in the following path
~/ros2_ws/src/vlms_with_ros2_workshop/ros2_vlm/ros2_vlm/modules/openvino_irs
cd ~/ros2_ws
colcon build --symlink-install
source ~/ros2_ws/install/setup.bash
GroundedSAM tackles object detection and segmentation. It integrates various open-world models, allowing to not just detect objects but also understand their specific regions. This can empower robots to act on specific parts (e.g., grasping a cup's handle) based on textual instructions or visual cues.
ros2 run ros2_vlm grounded_sam --ros-args -p device:='CPU' -p video_source:=/dev/video2 -p isSegment:=False -p detectionList:="["eyes", "person", "hair"]"
- device - Inference Device (e.g. CPU, GPU, NPU)
- video_source - Video source to get the image frame
- isSegment - To run Segment Anything model (True/False)
- detectionList - List of objects to detect
Check out more in the GroundedSAM OpenVINO Notebook
BLIP bridges the gap between vision and language. It analyzes images and extracts meaningful information, generating captions describing the scene or answering questions about it. This lets robots not only "see" their environment but also understand its context and respond to natural language instructions effectively.
ros2 run ros2_vlm blip_visual_qna --ros-args -p device_name:="GPU.0" -p question:="What is in the image?" -p image_path:="/home/nilutpol/ai_ws/src/blip_qna_code/demo2.jpg"
- device_name - Inference Device (e.g. CPU, GPU, NPU)
- question - Question for the blip model
- image_path - Path to the image source
Check out more in the BLIP Visual Question Answering OpenVINO Notebook