DenseAttention: No-Compromise Exact All N × N Interactions Algorithm with O(N) Space and Time Complexity
This repository hosts the code of the official implementation and the experiments for the paper "DenseAttention: No-Compromise Exact All N × N Interactions Algorithm with O(N) Space and Time Complexity".
This research goes towards achieving several goals:
- To create hardware efficient yet hardware-agnostic architecture with the arithmetic intensity ratio as high as possible. <...>
- To create an algorithm which would efficiently process long sequences, preferrably with $O(N)$ time and space complexity.
- To make the resulting architecture as simple as possible, and closely resembling original Transformer architecture as well so it can serve as a drop-in replacement for the former and be easily adopted by both research and practitioners communities.
The main contributions
From the abstract:
"We propose a novel DenseAttention Network architecture, a straightforward simplification of the standard Transformer block that addresses these issues and serves as a drop-in replacement for language modeling tasks. We eliminate memory-bound components in DenseAttention, including Softmax, masking, one skip connection, and both LayerNorms, as well as key, value, and output projection matrices, as they become redundant. Despite these removals, it maintains exact
The paper is undergoing a peer review at a major ML conference. However, the code and select parts of the manuscript are subject to changes.
The most recent draft of the paper (October 2024) is available at: DenseAttention.pdf
The code for DenseAttention Network, its constituents and models built upon it can be found in src/ directory.
For reproducing the experiments, it's recommended to create Docker container with preinstalled Cuda or a dedicated conda environment. To start, run:
git clone https://github.com/andrewargatkiny/dense-attention.git
pip install -r requirements.txt
export BASE_DATA_DIR="path/to/desired/data/directory"
To prepare data, use scripts in the data_preparation/ directory. It can take a long time.
To train any of the models, launch corresponding ds_train_*.sh
script from the configs/ directory
To evaluate BERT models, use either eval_all_seqs.sh
for bulk evaluations
or individual scripts in the bert_evals directory.
It’s recommended to use ClearML open-source ML experiments tracking system for
the training. Here are the instructions on how to install it on your system
and how to use it.
ClearML is enabled by default, if you’d prefer not to use it, disable it by typing --no_clearml
in deepspeed_train.py
params of the training script.