Multipack Efficiency

Question

Multipack Efficiency

KeremTurgutlu opened this issue a year ago · 9 comments

Thanks for putting together this!

I am looking into multipack sampler to have a better understanding of what its doing. My initial understanding is that it will pack sequence so that each bin satisfies total length in batch < bs x seqlen. Later collator is padding to the longest sequence. I created a toy example to check the unpadded token ratios in each batch, and it turned out to be lower than I expected. I also printed to efficiency() computed in the batch sampler and it gives a different number.

class DummyTokenizer:
    pad_token_id = 0

@dataclass
class DataCollatorForSupervisedDataset(object):
    """Collate examples for supervised fine-tuning."""

    tokenizer: transformers.PreTrainedTokenizer

    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        input_ids, labels = tuple(
            [instance[key] for instance in instances] for key in ("input_ids", "labels")
        )
        
        # BEGIN: added line to return torch.tensor
        input_ids = [torch.tensor(x) for x in input_ids]
        labels = [torch.tensor(x) for x in labels]        
        # END

        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
        )
        labels = torch.nn.utils.rnn.pad_sequence(
            labels, batch_first=True, padding_value=-100
        )

        return dict(
            input_ids=input_ids,
            labels=labels,
            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
        )

ds = [(torch.ones(x)*x).long() for x in np.random.permutation(np.arange(1,101))]
ds = [{"input_ids":x, "labels":x} for x in ds]
ds = datasets.Dataset.from_list(ds)
lengths = np.array([len(x['input_ids']) for x in ds])

train_sampler = MultipackDistributedBatchSampler(
    batch_max_length=4*128,
    lengths=lengths,
    num_replicas=1,
    rank=0,
    seed=42,
)

tokenizer = DummyTokenizer()
collator = DataCollatorForSupervisedDataset(tokenizer)
train_loader = DataLoader(
    ds,
    pin_memory=False,
    collate_fn=collator,
    batch_sampler=train_sampler,
)

for b in train_loader:
    print((b['input_ids'] != tokenizer.pad_token_id).view(-1).float().mean())
    print(train_loader.batch_sampler.efficiency())


tensor(0.5262)
0.8966619318181818
tensor(0.6364)
0.8966619318181818
tensor(0.4837)
0.8966619318181818
tensor(0.4582)
0.8966619318181818
tensor(0.6306)
0.8966619318181818
tensor(0.7002)
0.8966619318181818
tensor(0.5488)
0.8966619318181818
tensor(0.5200)
0.8966619318181818
tensor(0.4594)
0.8966619318181818
tensor(0.8535)
0.8966619318181818
tensor(0.5618)
0.8966619318181818

Maybe I am missing something here. Thanks!

Answer 1 · 2023-10-05T18:37:09.000Z

As far as I understand with this packing, it will bring in as many data samples into the batch as it can < bs * max_len. The data collator should be pad to the longest in the batch. Could be the case I have a bug here somewhere. The multipack sampler comes from: https://github.com/imoneoi/multipack_sampler/blob/master/README.md#usage

Answer 2 · 2023-10-05T18:55:50.000Z

Thanks for the prompt response, that was my understanding of multipacker too. I will also check how it is leveraged in https://github.com/OpenAccess-AI-Collective/axolotl.

Answer 3 · 2023-10-05T19:35:51.000Z

Some quick tests for 1 epoch using 1k samples, multipack results in faster training compared to the torch sampler.

multipack sampler:

batch_size = 2
completion: 3 minutes 57 seconds
---
batch_size = 3
completion: OOM

torch sampler:

batch_size = 6
completion: 6 minutes 10 seconds
---
batch_size = 7
completion: OOM

Answer 4 · 2023-10-05T19:50:26.000Z

I guess torch sampler here is just a random sampler. I can take another look at it then. My main concern was the high padded token ratio, maybe I can use the data you have in the repo to compute it again. ORCA paper actually mentions a different packing algorithm https://github.com/graphcore/tutorials/tree/sdk-release-2.1/blogs_code/packedBERT. I found it to be slow that's why I wanted to check this method.

https://github.com/graphcore/tutorials/blob/e9dbe4825f034a47871c4db0deb86d727cbd69b9/blogs_code/packedBERT/nnlshp.py#L51 is the main solver, if it can be speed up it can be used as well.

Answer 5 · 2023-10-06T09:55:05.000Z

When performing multipacking, shouldn't the attention mask be adjusted as well ? Otherwise, there should be an information leak between two packed examples

Answer 6 · 2023-10-06T14:32:35.000Z

When performing multipacking, shouldn't the attention mask be adjusted as well ? Otherwise, there should be an information leak between two packed examples

I would say currently it’s naive packing with eos token as separator. It works in my runs, as far as I can tell. It is also mentioned in this paper:

During training we always train on sequences of the full nctx = 2048 token context window, packing multiple
documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency.
Sequences with** multiple documents are not masked in any special way **but instead documents within a sequence
are delimited with a special end of text token, giving the language model the information necessary to infer that
context separated by the end of text token is unrelated

paper: https://arxiv.org/abs/2005.14165

Answer 7 · 2023-10-06T14:42:46.000Z

You can also use the torch sampler, though when I did runs comparing the two I did not notice any significant difference evaluating the model. The difference was the training time

Answer 8 · 2023-10-06T16:12:22.000Z

https://discord.com/channels/1104757954588196865/1104758010959634503/1159194895483941074

This is from axolotl discord. They concat all the batch inputs into a single tensor 1x(bsxseqlen) and use flash attn varlen scaled dot product with cumulative seqlens of each example in the tensor. In this case the naive approach might be better than random due to chance?

Answer 9 · 2023-10-07T01:21:46.000Z

Here I tested the idea: https://gist.github.com/KeremTurgutlu/847dd84519e28df85e68f8d88dc29905