/lora-exp

Using LoRA technique to finetuning model (English-Vietnamese)

Primary LanguageJupyter Notebook

LoRA Experiment

Implementation of LoRA: Low-Rank Adaptation of Large Language Models.

This project is a part of TF06 Course from ProtonX. We use LoRA techique to improve training Large Language Model.

We use Bloomz-1b1 to fine tuning on English - Vietnamese datasets.

Give us a star if this repo helpful to you.

Slide about LoRA Explain (by Nguyen Bui Ngoc Han):

I. How to run our pretrained model?

You just download the .ipybn file and run it on Google Colab or on your Jupyter Notebook.

image

Live demo (Click icon below to run in Colab):

II. How to add LoRA to finetuining your own model?

  • Step 1: Load your model.

    For example you have model like this:

    from transformers import AutoModelForCausalLM
    from transformers import AutoTokenizer
    modelName = "bigscience/bloomz-1b1" # Or whatever you want in HuggingFace
    model = AutoModelForCausalLM.from_pretrained(modelName).to(device)
    tokenizer = AutoTokenizer.from_pretrained(modelName)

    The device is your hardware support. You can set it automatically with this code:

    import torch
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
  • Step 2: Prepare dataset for Training.

    For example you want to make a text-generating model for question-anwsering task, you will need a dataset that it have list of questions and anwsers. You can try this dataset for practice:

    Get dataset from source:

      !wget https://raw.githubusercontent.com/phatjkk/data/main/LLM/Ecommerce_FAQ_Chatbot_dataset.json
    

    Load dataset as HuggingFace Dataset type:

      from datasets import load_dataset
      from datasets import Dataset
      data = load_dataset('json', data_files='Ecommerce_FAQ_Chatbot_dataset.json')
      ds = Dataset.from_list(data["train"]["questions"][0])

    Merge question and answer columns into one call prediction:

      def merge_columns(example):
          example["prediction"] = example["question"] + " ->: " + str(example["answer"])
          return example
      # Map merge_columns function to dataset
      ds = ds.map(merge_columns)

    Tokenizer prediction column:

      # Tokenizer/Véc tơ hóa văn bản (Chuyển văn bản thành số để training)
      def tokeni(example):
          example["prediction_token"] = tokenizer(example["prediction"], return_tensors='pt', padding=True)['input_ids']
          return example
      # Map tokeni function to dataset
      ds = ds.map(tokeni,batched=True)
  • Step 3: Add LoraConfig Adapter to model

      # Set config for LoRA 
      from peft import LoraConfig, get_peft_model
      config = LoraConfig(
          r=16, #attention heads
          lora_alpha=16, #alpha scaling
          lora_dropout=0.05,
          bias="none",
          task_type="CAUSAL_LM" # set this for CLM or Seq2Seq
      )
      # Set peft adapter to model
      model_lora = get_peft_model(model, config)

    There are some explain arguments for this code:

    • r: Lora attention dimension (int).
    • lora_alpha: The alpha parameter for Lora scaling.
    • lora_dropout: The dropout probability for Lora layers.
    • bias: Bias type for Lora. Can be 'none', 'all' or 'lora_only'
    • task_type: Task you want to run
  • Step 4: Training model

      # Training model
      import transformers
      from transformers import Trainer,EarlyStoppingCallback
      
      class CustomTrainer(Trainer):
          def compute_loss(self, model, inputs, return_outputs=False):
              outputs = model(**inputs)
              #Perplexity
              perplexity = torch.exp(outputs.loss)
              return (perplexity, outputs) if return_outputs else perplexity
      trainer = CustomTrainer(
          model=model,
          train_dataset=ds_tt["train"]["prediction"],
          eval_dataset=ds_tt["test"]["prediction"],
          args=transformers.TrainingArguments(
              per_device_train_batch_size=3, # batch size
              num_train_epochs=1, # epochs
              gradient_accumulation_steps=1,
              warmup_steps=100,
              save_total_limit=5,
              learning_rate=2e-4,
              fp16=True,
              output_dir='outputs',
              logging_steps=500,
              evaluation_strategy="steps",
              load_best_model_at_end = True
          ),
          data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
          callbacks=[EarlyStoppingCallback(early_stopping_patience = 4)]
      )
      model.config.use_cache = True  # silence the warnings. Please re-enable for inference!
      trainer.train()

    When finish training task you can show the loss curve of train and validation:

     trainingEpoch_loss_adam,validationEpoch_loss_adam=[],[]
     t = 0
     for i in trainer.state.log_history[:-1]:
       if t == 0:
         trainingEpoch_loss_adam.append(i["loss"])
         t=1
       else:
         validationEpoch_loss_adam.append(i["eval_loss"])
         t=0
     from matplotlib import pyplot as plt
     plt.plot(trainingEpoch_loss_adam, label='train_loss')
     plt.plot(validationEpoch_loss_adam,label='val_loss')
     plt.legend()
     plt.show

    Example result:

- Step 5: Test generate task

You can gennerate text from model like this:

  question = "How can I create an account?"
  prompt = question+" ->: "
  inputs = tokenizer( question, return_tensors="pt")
  with torch.autocast(device.type):
      outputs = model.generate(input_ids=inputs["input_ids"].to(device), max_new_tokens=100)
      print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])

Example Result:

How can I create an account? ->:  Click the "Create an account" button. Enter your email address and password. Click the "Continue" button.

III. About datasets

In this project we use data set from 3 source:

IV. Result and Comparision

Model result:

- NLLB + viquad Dataset (Vietnamese): (training_loss=2.1773)
- Ecommerce FAQ Chatbot Dataset (English): (training_loss=2.3110)
- Ecommerce FAQ Chatbot Dataset (Vietnamese): (training_loss=2.0299)

Time compare:

  • Model bloomz-1b1 train data NLLB, 1 epoch (Using LoRA) (Train on V100 Colab)

  • Model bloomz-1b1 train data NLLB, 1 epoch (without LoRA) (Train on V100 Colab)

Compare Table:

LoRA Without LoRA
Time Training ~157m ~202m

So with LoRA technique, we reduce the training time 22.2% in NLLB-57k dataset with bloomz-1b1 model.

Authors:

Nguyen Thanh Phat (phatjk)

Nguyen Bui Ngoc Han (Nguyễn Hân)

Nguyen Thanh Chung (Edward Nguyen)

Pham Quynh Trang (Trang Pham)

Advisors:

Nguyen Ba Ngoc