JSALT 2018 Grounded Sequence-to-Sequence Transduction

We are a team of computational linguistics researchers from different institutions (academia and industry). We will be working on algorithms for language grounding using multiple modalities during six weeks in the Frederick Jelinek Memorial Summer Workshop at John Hopkins University.

What is the project about?

Video understanding is one of the hardest challenges in Artificial Intelligence research. If a machine can look at videos, and “understand” the events that are being shown, then machines could learn by themselves, perhaps even without supervision, simply by “watching” broadcast TV, Facebook, Youtube, or similar sites.

As a first step, we will combine written, spoken, and seen objects and actions in how-to videos: if we see a person slicing round, red objects and putting them on a brown surface, it is more likely that she or he is explaining how to make a sandwich than how to change a car tire. And we might learn that the red objects are called “tomato”. Our team will develop methods that exploit multimodality to process and analyze videos to accomplish three main tasks: speech captioning, video-to-text summarization and translation into a different language. These tasks are diverse but not unrelated. Therefore, we propose to model them using a multi-task (sequence-to-sequence?) learning framework where these (and other, auxiliary) tasks can benefit from shared representations.

The tasks we propose generate natural language, which has a number of well-known challenges, such as dealing with lexical, syntactic and semantic ambiguities, and referential resolution. Grounding language using other modalities, e.g. visual and audio information such as what we propose here, can help overcome these challenges. Information extracted from speech, audio and video will serve as rich context models for the various tasks we plan to address.

Dataset - HowTo Videos Corpus

The dataset we will use in this project is a set of instructional videos called the HowTo corpus containing about 2000 hours of speech. We are collecting Portuguese (possibly Turkish) translations for these videos via crowd sourcing. We will also be collecting a special summarization dataset from these HowTo videos. A dataset website and more information coming soon -- in the meantime...

Here is the topic distribution visualization for this dataset. We find that 25 topics including yoga, cooking, sports, guitar, sewing, and many more are most representative of the dataset. Set the relevance metric on the right to ~0.2 and click on a particular topic cluster to see the top words in each topic. Toggle the options in the interactive visualization and have fun!

Team Leader

Lucia Specia (Sheffield University)

Senior Researchers

Loic Barrault (Le Mans University)
Desmond Elliott (University of Edinburgh)
Christian Fuegen (Facebook)
Florian Metze (Carnegie Mellon University)

Postdoctoral Researchers

Pranava Madhyastha (Sheffield University)
Josiah Wang (Sheffield University)

Graduate Students

Ozan Caglayan (Le Mans University)
Spandana Gella (University of Edinburgh)
Nils Holzenberger (Johns Hopkins University)
Shruti Palaskar (Carnegie Mellon University)
Ramon Sanabria (Carnegie Mellon University)
Amanda Duarte (Universitat Politècnica de Catalunya)
Jindřich Libovický (Charles University)

Undergraduate Students

Jasmine (Sun Jae) Lee (University of Pennsylvania)
Karl Mulligan (Rutgers University)
Alissa Ostapenko (Worcester Polytechnic Institute)

Related Publications

ASR
- https://arxiv.org/pdf/1712.00489.pdf

Here there is a part of the team:

Picture with all of us coming soon :)

Cheers!

mandacduarte/jsalt-2018-grounded-s2s