This code illustrates the Dispatcher algorithm as presented in the paper.

shift_and_sum

Installation

virtualenv --python=/usr/bin/python3 .env
pip install -r requirements.txt

Training the models

The models can be trained anew using the following scripts

train_dispatcher_after_openwebtext_wikitext2.py
train_dispatcher_after_openwebtext_wikitext103.py
train_msa_wikitext2.py
train_msa_wikitext103.py
train_plain_dispatcher_on_wikitext2.py
train_plain_dispatcher_on_wikitext103.py 

Evaluation

The perplexity of the pre-trained models can be evaluated using the following scripts

test_dispatcher_after_openwebtext_on_wikitext2.py
test_dispatcher_after_openwebtext_on_wikitext103.py
test_plain_dispatcher_on_wikitext2.py
test_plain_dispatcher_on_wikitext103.py

The plain dispatcher has about 30% more parameters on Wikitext103 because of a slighly different tokenization technique. The vocabulary of tokens is smaller on Wikitext2 to achieve a better performance.

Code

The Dispatcher is identical to the Transformer architecture with one crucial difference: the self-attention layer is substituted with the Dispatcher layer.

This algorithm - explained in the paper - is contained in the file dispatcher_model.py The following code is this work's main contribution:

class DispatcherLayer(nn.Module):
    def __init__(self, embed_dim, num_heads, bptt, dropout=0.):
        super(DispatcherLayer, self).__init__()

        self._levels = int(math.log(bptt, 2))
        self._shifts = [pow(2, i) for i in range(self._levels)]

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.dropout = dropout

        self.linear_in = nn.Linear(self.head_dim, self.head_dim, bias=False)
        self.internal_attention = nn.Linear(self.head_dim, self._levels, bias=False)
        self.linear_out = nn.Linear(self.head_dim, self.head_dim, bias=False)

    def forward(self, value, mask):
        inp = value.transpose(1, 0)
        batch_length = inp.shape[0]
        length = inp.shape[1]
        inp = inp.reshape(batch_length * self.num_heads, length, self.head_dim)

        V = self.linear_in(inp)

        coefficient_tensor = F.sigmoid(self.internal_attention(inp)) * mask.detach()
        coefficient = torch.chunk(coefficient_tensor, chunks=self._levels, dim=2)

        for c, shift in zip(coefficient, self._shifts):
            if shift > length:
                break
            if self.training and random.uniform(0, 1) < self.dropout:
                continue
            V += c * torch.roll(V, shifts=shift, dims=1)

        out = self.linear_out(V)
        out = out.reshape(batch_length, length, self.embed_dim)
        return out.transpose(1, 0)

The main loop is in the forward() method, where the shift and sum steps are applied (see the paper).

A second file contains the "standard" Masked Self-Attention model msa_model.py. The two models are nearly identical, with the exception of the Dispatcher layers.

Multihead Dispatcher

The code above is the one used in the paper. After submission, a new model was found that performs better with multiple heads. This model can be found in this repo at dispatcher/dispatcher_model_multihead.py. Please use this latest model if you plan for competitive multihead results.

Run the code

A notebook is included here to run the code and generate texts using the various models: dispatcher.ipynb.