Continual Transformers: Redundancy-Free Attention for Online Inference

Official implementation of Continual Transformers including ready-to-use modules for Continual Inference.

Fig. 1: Continual Retroactive Dot-Product Attention. The query (Q), key (K), and value (V) matrices are aggregated over time by caching the step vectors q_n, k_n, and v_n in a FIFO queue. During each step, only the entries of A associated with q_n, k_n, and the oldest K step, k_o are computed. The diagonal entries of the row-normalisation matrix D as well as the AV can be updated retroactively by subtracting features corresponding to k_o and adding features related to k_n to the cached outputs of the previous step, D_{mem} and AV_{mem}, respectively.

Fig. 2: Continual Single-Output Dot-Product Attention. The key (K) and value (V) matrices are aggregated over time by caching the step vectors k_n and v_n in a FIFO queue. During each step, only the attention output associated with q is computed.

Setup

Continual Transformers and its modules can be installed in in your project using:

pip install git+https://github.com/LukasHedegaard/continual-transformers.git

Experiments and results

The experiment code-base is split into seperate repositories for Online Action Detection and Online Audio Classification. Below, we present a summary of result from the paper.

Citation

@article{hedegaard2022cotrans,
  title={Continual Transformers: Redundancy-Free Attention for Online Inference},
  author={Lukas Hedegaard and Alexandros Iosifidis},
  journal={International Conference on Learning Representations (ICLR)},
  year={2023}
}

Contributing

See CONTRIBUTING.md