GoLD: A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning

NeurIPS 2021

Gaoussou Youssouf Kebe¹, Padraig Higgins¹, Patrick Jenkins¹, Kasra Darvish¹, Rishabh Sachdeva¹, Ryan Barron¹, John Winder^{1, 3}, Don Engel¹, Edward Raff^{1, 2}, Francis Ferraro¹, Cynthia Matuszek¹

¹ _{University of Maryland, Baltimore County (UMBC)}
² _{Booz Allen Hamilton} ³ _{Johns Hopkins Applied Physics Laboratory}

1. Introduction
2. Dataset downloading
3. How to cite

1. Introduction

The Grounded Language Dataset, or GoLD, is a grounded language learning dataset in four modalities: RGB, depth, text, speech. The data contains 207 instances of 47 object classes. The objects are from five high level categories of food, home, medical, office, and tool. Each instance is captured from different angles for a total of 825 images. Text and speech descriptions are collected using Amazon Mechanical Turk (AMT) for a total of 16500 text descriptions and 16500 speech descriptions.

The data is intended for use in multimodal grounded language acquisition tasks for domestic robots and for testing algorithmic differences between the domains.

2. Dataset downloading

The dataset consists of a directory of images, a directory of wav files, and two tsv files with descriptions. Each image label is formated as <object name>_<instance number>_<frame number>. wav files are labeled as <object name>_<instance number>_<frame number>_<description number>.

The structure of the image files looks like

images
└── RGB
    └── allen_wrench
        └── allen_wrench_1
            ├── allen_wrench_1_1.png
            ├── allen_wrench_1_2.png
            └── ...
        └── allen_wrench_2
            ├── allen_wrench_2_1.png
            ├── allen_wrench_2_2.png
            └── ...
        └── ...
    └── apple
        └── apple_1
            ├── apple_1_1.png
            ├── apple_1_2.png
            └── ...
        └── ...
    └── ...
└── RGB_cropped
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── RGB_raw
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── depth
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── depth_cropped
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── depth_raw
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── pcd
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── pcd_cropped
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...
└── pcd_visualization
    └── allen_wrench
        └── ...
    └── apple
        └── ...
    └── ...

images contains 8 folders

RGB: RGB images with background masked out
RGB_cropped: RGB images with background cropped out
RGB_raw: Full RGB images
depth: Depth images with background masked out
depth_cropped: Depth images with background cropped out
depth_raw: Full depth images
pcd: Full point clouds
pcd_cropped: Point clouds with background cropped out
pcd_visualization: Visualizations of the point clouds

speech.tsv contains 6 fields

hit_id: AMT hit id
worker_id: anonymized worker id
worktime_s: time in seconds to complete the AMT task
item_id: label for the object, instance, and frame number
wav: name of the related wav file in the speech directory
transcription: the Google speech-to-text transcription

text.tsv contains 5 fields:

hit_id: AMT hit id
worker_id: anonymized worker id
worktime_s: time in seconds to complete the AMT task
item_id: label for the object, instance, and frame number
text: a single text description for this instance

Video files are available upon request.

3. How to cite

@inproceedings{
kebe2021a,
title={A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning},
author={Gaoussou Youssouf Kebe and Padraig Higgins and Patrick Jenkins and Kasra Darvish and Rishabh Sachdeva and Ryan Barron and John Winder and Donald Engel and Edward Raff and Francis Ferraro and Cynthia Matuszek},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
year={2021},
url={https://openreview.net/forum?id=Yx9jT3fkBaD}
}

iral-lab/gold

GoLD: A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning

Table of contents

1. Introduction

2. Dataset downloading

3. How to cite