huggingface/peft

Training with out Trainer class. (VRAM usage issue)

venzino-han opened this issue · 4 comments

System Info

Python 3.10.12
transformers 4.40.0.dev0
peft 0.10.1.dev0
torch 2.1.2

I tried to apply lora with peft.
But it still using same amount of GPU VRAM.

How can I train model without using transformers.Trainer class?

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

        model = T5ForConditionalGeneration.from_pretrained(args.model_name)
        peft_config = LoraConfig(
            peft_type="LORA",
            task_type=TaskType.SEQ_2_SEQ_LM, 
            r=4, 
            lora_alpha=32,
            target_modules=["q","v"],
            lora_dropout=0.1
        )
        model = get_peft_model(model, peft_config)
        model.print_trainable_parameters()


        self.optimizer = torch.optim.AdamW(
            model.parameters(),
            lr=args.learning_rate,
            weight_decay=args.weight_decay, 
            eps=args.adam_epsilon
        )



    def train_epoch(self, dataloader, epoch):
        epoch_loss = 0
        # self.model.train()
        # self.model = self.model.to(self.device)
        accumulated_embeddings = None
        accumulated_scores = None
        count = 0
        for batch in tqdm(dataloader):
            count += 1
            self.optimizer.zero_grad()
            self.model.zero_grad()

            # Extract and send batch data to the specified device
            source_ids = batch["source_ids"].to(self.device)
            attention_mask = batch["source_mask"].to(self.device)
            decoder_attention_mask = batch["target_mask"].to(self.device)
            target_ids = batch["target_ids"].to(self.device)
            scores = batch["y"].to(self.device)
            pids = batch["prompt_id"].to(self.device)
            scores = scores.to(torch.bfloat16)

            # Forward pass and calculate loss
            outputs = self.model(
                input_ids=source_ids,
                attention_mask=attention_mask,
                decoder_attention_mask=decoder_attention_mask,
                labels=target_ids,
                return_dict=True,
            )

            loss = outputs.loss

            loss.backward()
            torch.nn.utils.clip_grad_norm_(
                self.model.parameters(), self.args.max_grad_norm
            )
            self.optimizer.step()  # Update model parameters
            epoch_loss += loss.item()

        # Get the current learning rate from scheduler or optimizer
        lr = (
            self.scheduler.get_last_lr()[0]
            if self.scheduler
            else self.optimizer.param_groups[0]["lr"]
        )

        log = f"epoch: {epoch}  | "
        log += f"train loss: {epoch_loss/len(dataloader):.6f} | "
        log += f"lr: {lr:.6f} |"

        if self.scheduler:
            self.scheduler.step()  # Update learning rate

        return epoch_loss / len(dataloader)

Expected behavior

I want to run peft without Trainer.

How did you measure the VRAM usage? What values do you get when you run with vs without PEFT? As your code is not complete, I cannot try to replicate your issue.

I measured the VRAM usage by wandb. However, the usages before and after adopting LoRA was similar.
I just want to know that peft can work same when I don't use Trainer in transformers.
@BenjaminBossan Thanks for your kind reply.

Yes, PEFT can definitely work without Trainer, at first glance you code that you posted looks correct.

When it comes to memory savings of LoRA, it depends on many factors: model size, sequence length, choice of optimizer, LoRA config settings, etc. It has also happened in the past that PyTorch would reserve more memory than it actually needs when using LoRA, so also keep an eye out for reserved vs allocated memory.

@BenjaminBossan thank for your kind comment!