This repo contains codes for the following paper:
Hongxin Zhang*, Weihua Du*, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, Chuang Gan: Building Cooperative Embodied Agents Modularly with Large Language Models
Paper: Arxiv
Project Website: Co-LLM-Agents
[9/4/2023]: ThreeDWorld Multi-Agent Transport
no longer provide ground truth segmentation mask in default. We implement a vision detection module with a fine-tuned Mask-RCNN model. For more details, please read README in tdw_mat.
[8/1/2023]: We provide the VirtualHome Simulator executable we used here. If you met XDG_RUNTIME_DIR not set in the environment
error previously, please check if you are using the new version we provided.
For detailed instructions on the installation of the two embodied multi-agent environments Communicative Watch-And-Help
and ThreeDWorld Multi-Agent Transport
, please refer to the Setup sections in cwah/README.md
and tdw_mat/README.md
respectively.
Run the following commands step by step to set up the environments:
cd tdw_mat
conda create -n tdw_mat python=3.9
conda activate tdw_mat
pip install -e .
If you're running TDW on a remote Linux server, follow the TDW Installation Document to configure the X server.
After that, you can run the demo scene to verify your setup:
python demo/demo_scene.py
Step 1: Get the VirtualHome Simulator and API
Clone the VirtualHome API repository:
git clone --branch wah https://github.com/xavierpuigf/virtualhome.git
Download the Simulator (Linux x86-64 version), and unzip it.
The files should be organized as follows:
|--cwah/
|--virtualhome/
|--executable/
Step 2: Install Requirements
cd cwah
conda create --name cwah python=3.8
conda activate cwah
pip install -r requirements.txt
The main implementation code of our CoELA is in tdw_mat/LLM
and tdw_mat/tdw_gym/lm_agent.py
.
We also prepare example scripts to run experiments with HP baseline and our CoELA under the folder tdw_mat/scripts
.
For example, to run experiments with two CoELA on ThreeDWorld Multi-Agent Transport
, run the following command in folder tdw_mat
.
./scripts/test_LMs-gpt-4.sh
We extend the ThreeDWorld Transport Challenge into a multi-agent setting with more types of objects and containers, more realistic object placements, and support communication between agents, named ThreeDWorld Multi-Agent Transport (TDW-MAT), built on top of the TDW platform.
The agents are tasked to transport as many target objects as possible to the goal position with the help of containers as tools. One container can carry most three objects, and without containers, the agent can transport only two objects at a time. The agents have the ego-centric visual observation and action space as before with a new communication action added.
We selected
The tasks are named food task
and stuff task
. Containers for the food task
can be found in both the kitchen and living room, while containers for the stuff task
can be found in the living room and office.
The configuration and distribution of containers vary based on two distinct settings: the Enough Container Setting
and the Rare Container Setting
. In the Enough Container Setting
, the ratio of containers to objects stands at Rare Container Setting
, the container-to-object ratio decreases to Rare Container Setting
are strictly localized to a single room.
One example of scenes, target objects, and containers is shown in the following image:
- Transport Rate (TR): The fraction of the target objects successfully transported to the goal position.
- Efficiency Improvements (EI): The efficiency improvements of cooperating with base agents.
Communicative Watch-And-Help(C-WAH) is an extension of the Watch-And-Help challenge, which enables agents to send messages to each other. Sending messages, alongside other actions, takes one timestep and has an upper limit on message length.
Five types of tasks are available in C-WAH, named Prepare afternoon tea
, Wash dishes
, Prepare a meal
, Put groceries
, and Set up a dinner table
. These tasks include a range of housework, and each task contains a few subgoals, which are described by predicates. A predicate is in ON/IN(x, y)
format, that is, Put x ON/IN y
. The detailed descriptions of tasks are listed in the following table:
Task Name | Predicate Set |
---|---|
Prepare afternoon tea | ON(cupcake,coffeetable), ON(pudding,coffeetable), ON(apple,coffeetable), ON(juice,coffeetable), ON(wine,coffeetable) |
Wash dishes | IN(plate,dishwasher), IN(fork,dishwasher) |
Prepare a meal | ON(coffeepot,dinnertable),ON(cupcake,dinnertable), ON(pancake,dinnertable), ON(poundcake,dinnertable), ON(pudding,dinnertable), ON(apple,dinnertable), ON(juice,dinnertable), ON(wine,dinnertable) |
Put groceries | IN(cupcake,fridge), IN(pancake,fridge), IN(poundcake,fridge), IN(pudding,fridge), IN(apple,fridge), IN(juice,fridge), IN(wine,fridge) |
Set up a dinner table | ON(plate,dinnertable), ON(fork,dinnertable) |
The task goal is to satisfy all the given subgoals within
- Average Steps (L): Number of steps to finish the task;
- Efficiency Improvement (EI): The efficiency improvements of cooperating with base agents.
We noticed many interesting agents' behaviors exhibited in our experiments and identified several cooperative behaviors.
There are more interesting cases and demos on our website!
If you find our work useful, please consider citing:
@article{zhang2023building,
title={Building Cooperative Embodied Agents Modularly with Large Language Models},
author={Zhang, Hongxin and Du, Weihua and Shan, Jiaming and Zhou, Qinhong and Du, Yilun and Tenenbaum, Joshua B and Shu, Tianmin and Gan, Chuang},
journal={arXiv preprint arXiv:2307.02485},
year={2023}
}