huggingface/tokenizers

Gradients in Data Collator lead to Memory Leak

AhmadHAW opened this issue · 1 comments

Hi there, I am training a LLM and a GNN end-to-end on some knowledge graph dataset. In my attempt, I produce GNN embeddings of a datapoint inside my custom DataCollator's _convert_features_into_batches method, like so:

source_embeddings, target_embeddings = self.get_embeddings_cb( self.data, source_ids, target_ids ) graph_embeddings = torch.stack([source_embeddings, target_embeddings], dim=1) del source_embeddings, target_embeddings return { "input_ids": input_ids, "attention_mask": attention_mask, "labels": labels, "graph_embeddings": graph_embeddings, }

If I don't delete the source_embeddings and target embeddings manually, the gradients stay in memory which leads to memory leak eventually.

ChatGPT says this kind of behaviour is generaly known and I understand if this is not really a concern for you. I also know it makes more sense to produce the graph embeddings inside the models forward function. I will implement it that way in the future. I just wanted to let you know, that this may be an issue otherwise and can lead to unexpected behaviours.

Greetings, Ahmad

I will put this in the correct channel, sorry!!