Multipack Efficiency
KeremTurgutlu opened this issue · 9 comments
Thanks for putting together this!
I am looking into multipack sampler to have a better understanding of what its doing. My initial understanding is that it will pack sequence so that each bin satisfies total length in batch < bs x seqlen
. Later collator is padding to the longest sequence. I created a toy example to check the unpadded token ratios in each batch, and it turned out to be lower than I expected. I also printed to efficiency()
computed in the batch sampler and it gives a different number.
class DummyTokenizer:
pad_token_id = 0
@dataclass
class DataCollatorForSupervisedDataset(object):
"""Collate examples for supervised fine-tuning."""
tokenizer: transformers.PreTrainedTokenizer
def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
input_ids, labels = tuple(
[instance[key] for instance in instances] for key in ("input_ids", "labels")
)
# BEGIN: added line to return torch.tensor
input_ids = [torch.tensor(x) for x in input_ids]
labels = [torch.tensor(x) for x in labels]
# END
input_ids = torch.nn.utils.rnn.pad_sequence(
input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id
)
labels = torch.nn.utils.rnn.pad_sequence(
labels, batch_first=True, padding_value=-100
)
return dict(
input_ids=input_ids,
labels=labels,
attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
)
ds = [(torch.ones(x)*x).long() for x in np.random.permutation(np.arange(1,101))]
ds = [{"input_ids":x, "labels":x} for x in ds]
ds = datasets.Dataset.from_list(ds)
lengths = np.array([len(x['input_ids']) for x in ds])
train_sampler = MultipackDistributedBatchSampler(
batch_max_length=4*128,
lengths=lengths,
num_replicas=1,
rank=0,
seed=42,
)
tokenizer = DummyTokenizer()
collator = DataCollatorForSupervisedDataset(tokenizer)
train_loader = DataLoader(
ds,
pin_memory=False,
collate_fn=collator,
batch_sampler=train_sampler,
)
for b in train_loader:
print((b['input_ids'] != tokenizer.pad_token_id).view(-1).float().mean())
print(train_loader.batch_sampler.efficiency())
tensor(0.5262)
0.8966619318181818
tensor(0.6364)
0.8966619318181818
tensor(0.4837)
0.8966619318181818
tensor(0.4582)
0.8966619318181818
tensor(0.6306)
0.8966619318181818
tensor(0.7002)
0.8966619318181818
tensor(0.5488)
0.8966619318181818
tensor(0.5200)
0.8966619318181818
tensor(0.4594)
0.8966619318181818
tensor(0.8535)
0.8966619318181818
tensor(0.5618)
0.8966619318181818
Maybe I am missing something here. Thanks!
As far as I understand with this packing, it will bring in as many data samples into the batch as it can < bs * max_len. The data collator should be pad to the longest in the batch. Could be the case I have a bug here somewhere. The multipack sampler comes from: https://github.com/imoneoi/multipack_sampler/blob/master/README.md#usage
Thanks for the prompt response, that was my understanding of multipacker too. I will also check how it is leveraged in https://github.com/OpenAccess-AI-Collective/axolotl.
Some quick tests for 1 epoch using 1k samples, multipack results in faster training compared to the torch sampler.
multipack sampler:
batch_size = 2
completion: 3 minutes 57 seconds
---
batch_size = 3
completion: OOM
torch sampler:
batch_size = 6
completion: 6 minutes 10 seconds
---
batch_size = 7
completion: OOM
I guess torch sampler here is just a random sampler. I can take another look at it then. My main concern was the high padded token ratio, maybe I can use the data you have in the repo to compute it again. ORCA paper actually mentions a different packing algorithm https://github.com/graphcore/tutorials/tree/sdk-release-2.1/blogs_code/packedBERT. I found it to be slow that's why I wanted to check this method.
https://github.com/graphcore/tutorials/blob/e9dbe4825f034a47871c4db0deb86d727cbd69b9/blogs_code/packedBERT/nnlshp.py#L51 is the main solver, if it can be speed up it can be used as well.
When performing multipacking, shouldn't the attention mask be adjusted as well ? Otherwise, there should be an information leak between two packed examples
When performing multipacking, shouldn't the attention mask be adjusted as well ? Otherwise, there should be an information leak between two packed examples
I would say currently it’s naive packing with eos token as separator. It works in my runs, as far as I can tell. It is also mentioned in this paper:
During training we always train on sequences of the full nctx = 2048 token context window, packing multiple
documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency.
Sequences with** multiple documents are not masked in any special way **but instead documents within a sequence
are delimited with a special end of text token, giving the language model the information necessary to infer that
context separated by the end of text token is unrelated
You can also use the torch sampler, though when I did runs comparing the two I did not notice any significant difference evaluating the model. The difference was the training time
https://discord.com/channels/1104757954588196865/1104758010959634503/1159194895483941074
This is from axolotl discord. They concat all the batch inputs into a single tensor 1x(bsxseqlen) and use flash attn varlen scaled dot product with cumulative seqlens of each example in the tensor. In this case the naive approach might be better than random due to chance?
Here I tested the idea: https://gist.github.com/KeremTurgutlu/847dd84519e28df85e68f8d88dc29905