GuessWhich is a cooperative image-guessing game between two agents: Q-BOT and A-BOT, like that of GuessWhat?! game that is an image object-guessing game between two players.
GuessWhich is a two player game played by Qbot and Abot. The goal of GuessWhich is to figure out a correct answer out of 9,628 test images by asking a sequence of questions. Abot can see the randomly assigned target image, which is unknown to Qbot. Qbot only observes a caption of the image generated from Neuraltalk2 (Vinyals & Le, 2015). To achieve the goal, Qbot asks a series of questions, to which Abot responds with a sentence. [This part is from the paper of ICLR 2019, Large-scale Answer in Questioner's Mind for Visual Dialog Question Generation, Sang-Woo Lee et al.]
The two agents communicate in natural language dialogue. In the beginning, they can see a broader set of images, in which ABot randomly selects an image as the secret that is not known to Q-BOT. Q-Bot asks a sequence of free-form natural language questions and ABot responds with free-form answers. In the end, QBot tries to identify the secret image from the fixed pool of images. If the right image is found, the dialogue is considered a success, otherwise, failure.
This PyTorch implementation is based on the PyTorch code of Learning Cooperative Visual Dialog Agents using Deep Reinforcement Learning [Das & Kottur et al., ICCV 2017]. Github:https://github.com/batra-mlp-lab/visdial-rl
PyTorch version: 1.2.0
CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
CMake version: Could not collect
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V
Nvidia driver version: 410.79
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] msgpack-numpy==0.4.3.2
[pip3] numpy==1.16.2
[pip3] numpydoc==0.8.0
[pip3] torch==1.2.0
[pip3] torchfile==0.1.0
[pip3] torchtext==0.7.0
[conda] Could not collect
GuessWhich is a challenging visual-language problem. It involves processing large amounts of images, and human's mental imagery that is spawned by a natural language dialogue consists of multi-round Question-Answer-pairs.
LaVi Tasks | conference | comment |
---|---|---|
GuessWhich | AAAI 2017 | 🐫 |
Multimodal Dialogs(MMD) | AAAI 2018 | - |
CoDraw | ACL 2019 | - |
GuessWhat?! | CVPR 2017 | 😄 |
Multi-agent GuessWhich | AAMAS 2019 | - |
Image-Chat | ACL 2020 | |
EmbodiedQA | CVPR 2018 | |
VideoNavQA | BMVC 2019 | |
GuessNumber | SLT 2018 | |
VisDial | CVPR 2017 | 🐫 |
Image-Grounded Conversations(IGC) | CVPR 2017 | |
VDQG | ICCV 2017 | |
RDG-Image guessing game | LREC 2014 | |
Deal or No Deal | CoRR 2017 | |
Video-Grounded Dialogue Systems (VGDS) | ACL 2019 | |
Vision-Language Navigation (VLN) | CVPR 2018 | |
Image Captioning | ||
Image Retrieval | ||
Visually-grounded Referring Expressions | ||
Multi-modal Verification | ACL 2019 | |
Viual Dialog based Referring Expression | ||
VQA |