/gold

Multimodal grounded language dataset

GoLD: A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning

NeurIPS 2021

Gaoussou Youssouf Kebe1, Padraig Higgins1, Patrick Jenkins1, Kasra Darvish1, Rishabh Sachdeva1, Ryan Barron1, John Winder1, 3, Don Engel1, Edward Raff1, 2, Francis Ferraro1, Cynthia Matuszek1

1 University of Maryland, Baltimore County (UMBC)
2 Booz Allen Hamilton 3 Johns Hopkins Applied Physics Laboratory

Table of contents

1. Introduction

The Grounded Language Dataset, or GoLD, is a grounded language learning dataset in four modalities: RGB, depth, text, speech. The data contains 207 instances of 47 object classes. The objects are from five high level categories of food, home, medical, office, and tool. Each instance is captured from different angles for a total of 825 images. Text and speech descriptions are collected using Amazon Mechanical Turk (AMT) for a total of 16500 text descriptions and 16500 speech descriptions.

The data is intended for use in multimodal grounded language acquisition tasks for domestic robots and for testing algorithmic differences between the domains.

2. Dataset downloading

The dataset consists of a directory of images, a directory of wav files, and two tsv files with descriptions. Each image label is formated as <object name>_<instance number>_<frame number>. wav files are labeled as <object name>_<instance number>_<frame number>_<description number>.

The structure of the image files looks like

images
└── RGB
    └── allen_wrench
        └── allen_wrench_1
            ├── allen_wrench_1_1.png
            ├── allen_wrench_1_2.png
            └── ...
        └── allen_wrench_2
            ├── allen_wrench_2_1.png
            ├── allen_wrench_2_2.png
            └── ...
        └── ...
    └── apple
        └── apple_1
            ├── apple_1_1.png
            ├── apple_1_2.png
            └── ...
        └── ...
    └── ...
└── RGB_cropped
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── RGB_raw
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── depth
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── depth_cropped
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── depth_raw
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── pcd
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── pcd_cropped
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── pcd_visualization
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...

images contains 8 folders

speech.tsv contains 6 fields

  • hit_id: AMT hit id
  • worker_id: anonymized worker id
  • worktime_s: time in seconds to complete the AMT task
  • item_id: label for the object, instance, and frame number
  • wav: name of the related wav file in the speech directory
  • transcription: the Google speech-to-text transcription

text.tsv contains 5 fields:

  • hit_id: AMT hit id
  • worker_id: anonymized worker id
  • worktime_s: time in seconds to complete the AMT task
  • item_id: label for the object, instance, and frame number
  • text: a single text description for this instance

Video files are available upon request.

3. How to cite

@inproceedings{
kebe2021a,
title={A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning},
author={Gaoussou Youssouf Kebe and Padraig Higgins and Patrick Jenkins and Kasra Darvish and Rishabh Sachdeva and Ryan Barron and John Winder and Donald Engel and Edward Raff and Francis Ferraro and Cynthia Matuszek},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
year={2021},
url={https://openreview.net/forum?id=Yx9jT3fkBaD}
}