🛋️ Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities

This repository provides the code and instructions for using the evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs, COnsistent Multilingual Frame Of Reference Test (COMFORT). Follow the steps below to set up the environment, generate data (optional), and run experiments. Feel free to create an issue if you encounter any problems. We also welcome pull requests.

Setup Environment
Prepare Data
Add API Credentials
Run Experiments
Run Evaluations
Evaluate More Models
Common Problems and Solutions

Setup Environment

Clone the repository and create a conda environment using the provided environment.yml file:

git clone https://github.com/sled-group/COMFORT.git
cd comfort_utils
conda env create -f environment.yml

After creating the environment:

conda activate comfort

Then, install editable packages:

cd models/GLAMM
pip install -e .

cd models/llava
pip install -e .

cd models/InternVL/internvl_chat
pip install -e .

You can also use Poetry to setup the environment.

Prepare data

Firstly, make a data directory:

mkdir data

(Option 1.) Download data from Huggingface

wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_ball.zip?download=true -O data/comfort_ball.zip
unzip data/comfort_ball.zip -d data/
wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_car_ref_facing_left.zip?download=true -O data/comfort_car_ref_facing_left.zip
unzip data/comfort_car_ref_facing_left.zip -d data/
wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_car_ref_facing_right.zip?download=true -O data/comfort_car_ref_facing_right.zip
unzip data/comfort_car_ref_facing_right.zip -d data/

(Option 2.) Data generation

pip install gdown
python download_assets.py
chmod +x generate_dataset.sh
./generate_dataset.sh

Add API Credentials

touch comfort_utils/model_utils/api_keys.py

Prepare OpenAI and DeepL API keys and add below to api_keys.py

APIKEY_OPENAI = <YOUR_API_KEY>
APIKEY_DEEPL = <YOUR_API_KEY>

Prepare Google Cloud Translate API credentials (.json)

Run Experiments

./run_english_ball_experiments.sh
./run_english_car_left_experiments.sh
./run_english_car_right_experiments.sh

export GOOGLE_APPLICATION_CREDENTIALS="your_google_application_credentials_path.json"
./run_multilingual_ball_experiments.sh
./run_multilingual_car_left_experiments.sh
./run_multilingual_car_right_experiments.sh

Run Evaluations

English

Preferred Coordinate Transformation (Table 2 & Table 7):
```
python gather_results.py --mode cpp --cpp convention
```

Preferred Frame of Reference (Table 3 & Table 8):

python gather_results.py --mode cpp --cpp preferredfor

Perspective Taking (Table 4 & Table 9):

python gather_results.py --mode cpp --cpp perspective

Comprehensive Evaluation (Table 5):

python gather_results.py --mode comprehensive

Multilingual (Figure 8 & Table 10)

python gather_results_multilingual.py

After evaluation completes:

cd results/eval
python eval_multilingual_preferredfor_raw.py

Evaluate More Models

We refer to Model Wrapper.

Common Problems and Solutions

ImportError: libcupti.so.11.7: cannot open shared object file: No such file or directory

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118