PacktPublishing/LLM-Engineers-Handbook

BUG: Hard-coded configs cause training and evaluation pipelines to fail

Closed this issue · 1 comments

Training Pipeline

~/LLM-Engineers-Handbook/llm_engineering/model/finetuning/finetune.py has hard-coded config on line 153:

dataset = load_dataset(f"{dataset_huggingface_workspace}/llmtwin-dpo", split="train")
if is_dummy:
    dataset = dataset.select(range(400)) # this is the hard-coded line

This causes the training pipeline to fail w/ fine-tuning_type = dpo because the preference train dataset only has 113 samples w/ the current default configs.

Evaluation Pipeline

~/LLM-Engineers-Handbook/llm_engineering/model/evaluation/evaluate.py has hard-coded config on line 202:

model_ids = [
    check_if_huggingface_model_exists(
        f"{MODEL_HUGGINGFACE_WORKSPACE}/TwinLlama-3.1-8B", default_value="mlabonne/TwinLlama-3.1-8B"
    ),
    check_if_huggingface_model_exists(
        f"{MODEL_HUGGINGFACE_WORKSPACE}/TwinLlama-3.1-8B-DPO", default_value="mlabonne/TwinLlama-3.1-8B-DPO"
    ),
    "meta-llama/Meta-Llama-3.1-8B-Instruct", # this is the hard-coded line
]

It appears Meta may have changed the name of "meta-llama/Meta-Llama-3.1-8B-Instruct". Also, this model name is inconsistent with the base model used in SFT fine-tuning on line 271 of ~/LLM-Engineers-Handbook/llm_engineering/model/finetuning/finetune.py:

if args.finetuning_type == "sft":
      print("Starting SFT training...")  # noqa
      base_model_name = "meta-llama/Meta-Llama-3.1-8B" # this is the hard-coded line

Currently, the training pipeline succeeds w/ fine-tuning_type = sft`, but the evaluation pipeline fails when attempting to access "meta-llama/Meta-Llama-3.1-8B-Instruct". Probably some of these configs should be exposed in the YAML or .env files to ensure consistency and make it easier to update naming conventions.

Hello @bgereke ,

I agree with you that it would have been better to add these as configs. The code is far from perfect, but sometimes you must make decisions because of lack of time.

For now, I updated the Llama model ids and extracted the dummy_dataset size as a parameters.

I am looking forward to a contribution from you if you want to implement what you proposed!