abhinand5/tamil-llama

Questions: Regarding training and fine-tuning

Closed this issue · 15 comments

What is the preferred strategy for fine-tuning: resuming training from pre-trained adapters(trained while pretraining) or creating a new adapter?

Hi @kdcyberdude,

When fine-tuning, the choice between starting from an "Instruct" model or the base (pretrained) model depends on the specific goals and requirements of the fine-tuning process.

  1. Fine-tuning from the Instruct Model: This approach is suitable if your goal is to enhance the model's ability to follow instructions more effectively. If you are focusing on tasks like reasoning, coding, or incorporating custom knowledge within the realms of Tamil/English language processing, fine-tuning from an instruct model would be appropriate. This is because the instruct model is already optimized for understanding and following complex instructions.

  2. Fine-tuning from the Base Model: This approach is more suitable for tasks that require significant adaptation of the model. Examples include:

    • Adding Support for Another Language: If you're introducing a completely new language, starting from the base model allows for more foundational changes in language understanding and generation.
    • Improving Abilities in Specialized Domains like Medicine or Business: For domain-specific adaptations, the base model offers a more neutral starting point, allowing for deeper and more focused learning in that particular field.

Each approach has its advantages depending on the final objectives of the fine-tuning.

Hi @abhinand5,

I'm curious to understand your fine-tuning approach when integrating a new language. Once you've pre-trained the Tamil-llama model, you obtain LoRA adapter weights. Were these LoRA weights directly used for further fine-tuning on instructional data, or did you first integrate them with the base llama model before conducting LoRA fine-tuning on the merged model?
In simpler terms, can you clarify whether you fine-tuned the existing LoRA weights or generated entirely new ones atop the pre-trained model? Is the resulting model is structured as Base model + LoRA(Pretrain) + LoRA(Instruction Tune) or if it follows a different composition, such as Base model + LoRA(Pretrain and Instruction tune both).

You can't directly take the LoRA adapters and finetune it. The pretrained LoRA adapters are merged with the model. After merging the new pretrained model would have the learned weights necessary to represent the new language. Then that pretrained model can be finetuned...

Here are the steps involved:

  1. Start with Pre-training: First, you pre-train the model with LoRA adapters to adapt it to a new language. These LoRA adapters obtained as a result are like special adjustments that help the model understand the new language better.

  2. Merge the Adapters with the Model: After pre-training, you take these LoRA adapters and combine them with the base model. Now, the model not only has its original capabilities but also the new skills it learned for the new language.

  3. Fine-tune for Specific Tasks: Next, you fine-tune this updated model for specific tasks, like understanding instructions. During this fine-tuning, you're making further adjustments, still using LoRA to save on compute, to make the model even better at these specific tasks.

  4. End Up with a Fully Adapted Model: Merge the LoRA weights with the pretrained model from (2). In the end, what you get is a model that's been both pre-trained and fine-tuned with LoRA. It's now good at the new language and also at the specific tasks you trained it for.

Thank you, @abhinand5, for clarifying and providing such a good explanation.

[You can't directly take the LoRA adapters and finetune it]

Regarding directly fine-tuning LoRA adapters - What do you think if we try to resume from the checkpoint after pertaining.

training_arguments = TrainingArguments(
    resume_from_checkpoint='./results/checkpoint-400/',)

trainer.train(resume_from_checkpoint=True)

Here './results/checkpoint-400/' is checkpoint path of LoRA adapters.

Hi @abhinand5, Did you use the text corpus created by generate_text_corpus.py for training? I have created separate txt file for each row from the dataframe. It's talking too much time to tokenize!!

Hi @kdcyberdude , for the first question on resuming from LoRA pre-trainined checkpoint, I don't think it is necessary, never seen anyone do that actually. I think starting a new LoRA training on the full pretrained model would be more beneficial.

The SentencePiece models expect a txt file with one sentence per line. So you essentially need to provide a sentence corpus. That script also assumes a sentence corpus. More details can be found in their docs -> https://github.com/google/sentencepiece?tab=readme-ov-file#usage-instructions

What about dataset_dir in run_pt.sh script. This dir expects *.txt files. Should we need to create a separate txt for each document from our dataset(600K in your case) or should we use the output of generate_text_corpus.py(single file containing all 600K docs) for training the llama model in run.pt.sh

@abhinand5, Is there any specific reason for using unigram tokenizer instead of BPE. The base Llama tokenizer is a BPE model -

image

@kdcyberdude Not a single file, a folder containing multiple text files. generate_text_corpus.py is meant for SP trainer. To convert a Huggingface dataset into text format you write a function like this.

def save_text_to_files_hf(dataset, text_column, num_rows_per_txt, output_directory, suffix=None):
    """
    Save text data from a HuggingFace Dataset into text files.

    Args:
    dataset (datasets.Dataset): The Dataset containing the text data.
    text_column (str): The name of the column containing the text data.
    num_rows_per_txt (int): The number of rows to store in each text file.
    output_directory (str): The directory where text files will be saved.
    """

    # Create the output directory if it doesn't exist
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)

    # Get the total number of rows in the Dataset
    total_rows = len(dataset)

    # Calculate the number of text files needed
    num_text_files = (total_rows + num_rows_per_txt - 1) // num_rows_per_txt

    for i in tqdm(range(num_text_files), total=num_text_files):
        # Calculate the start and end indices for the current chunk of rows
        start_idx = i * num_rows_per_txt
        end_idx = min((i + 1) * num_rows_per_txt, total_rows)

        # Extract the text data for the current chunk of rows
        chunk_data = dataset.select(range(start_idx, end_idx))[text_column]

        if suffix is None:
            fname = f"data_{i + 1:0{4}}.txt"
        else:
            fname = f"data_{suffix}_{i + 1:0{4}}.txt"

        # Create a text file and write the chunk data to it
        file_path = os.path.join(output_directory, fname)
        with open(file_path, 'w', encoding='utf-8') as file:
            for text in chunk_data:
                file.write(text + '\n')

Regarding tokenizer, I experimented with both BPE and unigram tokenizers during the development of Tamil-LLaMA. After manual examination of the tokenization outcomes, I found that default SP tokenizer provided superior results. It's important to note that the dataset used for training can influence the effectiveness of different tokenization methods. However, I anticipated that the choice of tokenizer would not significantly impact the performance of the pretrained model. This expectation held true, as the pretrained model with the extended SentencePiece vocabulary demonstrated consistent and cohesive text generation.

Moreover, it's worth mentioning that employing a combination of sub-word tokenization algorithms in Large Language Models (LLMs) is not unusual. This approach is also adopted for languages like Chinese. While it might be argued that the Unigram algorithm is more suited for non-space-separated languages like Chinese, and therefore not necessary for Tamil. My manual inspections suggested BPE model's outcomes were slightly worse and I could easily tell this as I can read the language fluently. Given that this choice did not compromise the model's ability to generate cohesive text in both English and Tamil, I decided to proceed with the SentencePiece tokenizer.


But, I am working on an improved version of Indic LLMs similar to Tamil LLaMA where the tokenizer is much improved and very similar in style to original Llama tokenizer.

@kdcyberdude Not a single file, a folder containing multiple text files. generate_text_corpus.py is meant for SP trainer. To convert a Huggingface dataset into text format you write a function like this.

def save_text_to_files_hf(dataset, text_column, num_rows_per_txt, output_directory, suffix=None):
    """
    Save text data from a HuggingFace Dataset into text files.

    Args:
    dataset (datasets.Dataset): The Dataset containing the text data.
    text_column (str): The name of the column containing the text data.
    num_rows_per_txt (int): The number of rows to store in each text file.
    output_directory (str): The directory where text files will be saved.
    """

    # Create the output directory if it doesn't exist
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)

    # Get the total number of rows in the Dataset
    total_rows = len(dataset)

    # Calculate the number of text files needed
    num_text_files = (total_rows + num_rows_per_txt - 1) // num_rows_per_txt

    for i in tqdm(range(num_text_files), total=num_text_files):
        # Calculate the start and end indices for the current chunk of rows
        start_idx = i * num_rows_per_txt
        end_idx = min((i + 1) * num_rows_per_txt, total_rows)

        # Extract the text data for the current chunk of rows
        chunk_data = dataset.select(range(start_idx, end_idx))[text_column]

        if suffix is None:
            fname = f"data_{i + 1:0{4}}.txt"
        else:
            fname = f"data_{suffix}_{i + 1:0{4}}.txt"

        # Create a text file and write the chunk data to it
        file_path = os.path.join(output_directory, fname)
        with open(file_path, 'w', encoding='utf-8') as file:
            for text in chunk_data:
                file.write(text + '\n')

I see, I've been following a similar approach, where I generate individual .txt files for each row (document) from the Hugging Face dataset. However, when I applied this method to 50K documents, I noticed that the tokenizing and grouping processes were taking a considerable amount of time. That's why I wanted to reach out and confirm whether my approach aligns with the expected efficiency.

Out of curiosity, could you share an estimate of the time it took to tokenize and group the 600K Tamil documents, specifically for each row?

Hey @abhinand5, steering away from our project for a moment – have you had a look at the first phase of training for the OpenHathi (Hindi Llama) model by Sarvam.ai? - https://www.sarvam.ai/blog/announcing-openhathi-series, I came across their approach, and while I'm not entirely certain, it looks like they might be using supervised fine-tuning (SFT Trainer) for translation in the First phase, as opposed to the more conventional next word prediction. Which they are doing in second phase. What are your thoughts on this? It seems like an interesting departure from the usual methods, and I'm curious to hear your perspective.

And I really appreciate you sparing some time to answer these questions. Thank you :)

@kdcyberdude For those preprocessing steps, I don't remember it taking much time for me, certainly not more than 10 minutes. I was using 16 vCPUs by the way.


Regarding OpenHathi, it is an interesting approach for sure. Even I don't fully understand the Phase 1 - translation alignment, as you said it could be just SFT where just the embedding layers were trained.

The Phase 2 is just continued pretraining, but with alternating sentences between Hindi and English.

From what we have seen with the Chinese models, I don't think such a complicated approach is really necessary for bilingual language understanding and generation.

For example take a look at some of the results from my experiments: (I can't read Hindi so relying on Google Translate to understand the outputs)

Example 1:
image

Example 2:
image

  • The response 1 seems to be good, but because of the nature of the training it alternates sentences in Hindi and English.
  • A slightly different way of prompting and boom, we see some wild hallucinations. (I've just changed the language here btw)
  • I think this approach might increase hallucination. I might be horribly wrong here! More experiments and comparing results from different approaches are needed to arrive at any sort of conclusion.
  • But nonetheless they have done an excellent job at creating the first Hindi LLM which is very promising.

I believe their adoption of this intricate approach aims to preserve the quality of both English and Romanized Hindi content, recognizing that existing English tokens in each language (both English and Romanized Hindi) may convey distinct meanings. The overarching objective seems to be aligning the embeddings and adapters with English weights, ensuring a seamless transfer of English knowledge.

I have watched the full 20-minute demo video of this yet-to-be-released fine-tuned model, it's evident that, following phase 3 fine-tuning, the alteration between English and Hindi sentences of phase-2 fine-tuned model is addressed through supervised training.

In my perspective, introducing an equivalent proportion of raw Hindi data, either in phase 2 within the alternate language sentence dataset, or potentially in phase 3 (although I'm unsure which would be more effective), could prove advantageous.

I'm intrigued – do you happen to know the impact on English quality in the context of Tamil-Llama or Chinese-Llama?

Hi @abhinand5,
Could you share the cost of the Google Translate API for creating an instruction-tuning dataset?