Visual-Vestibular Generative Query Network

Motivation

Generative Query Networks (GQNs) learn representations of their environments by predicting what they would look like from novel viewpoints. The architecture consists of a Representation Network that encodes a collection of viewpoint/image tuples, and a Generator Network that produces an image given the output of the Representation Network and a novel query viewpoint. To complete the task, the learned environment representations must be viewpoint-invariant and encode information about 3D structure/layout.

Taken as models of human scene representation, however, the model and task have 2 important shortcomings:

The task is not biologically plausible. Supervision is required in the form of ground-truth locations/viewpoints in an experimenter-defined reference frame.
The representation does not contain information about the agent and its position or orientation in the environment.

This work aims to construct a more biologically-motivated variant of GQNs that uses a similar predictive-coding task, but using vestibular movement inputs rather than ground-truth positions. The hope is that the learned representations will contain:

Viewpoint-invariant descriptions of the environment and its layout.
A learned environment-dependent reference frame (i.e. emergent grid cells).
A description of the agent's current position and orientation in the reference frame (i.e. emergent place hells and head direction cells).

Implementation

The model consists of:

A vision module (CNN), which takes as input the current image at time t and outputs a vector representing that image.
An environment representation module (LSTM), which takes as input a sequence of visual representation and vestibular movement tuples, and whose internal state represents both the environment and the agent's location in it.
A projection module (LSTM), which takes as input the current state of the environment and a sequence of vestibular movement tuples, and whose internal state represents both the environment and the agent's new location in it.
A generator module (CNN), which takes as input the projection module's current state and imagines what the environment would look like from this new viewpoint.

The environment currently consists of a very simple square room to ensure that learning is possible with the current models, losses, and task description.

EricElmoznino/VVGQN

Visual-Vestibular Generative Query Network

Motivation

Implementation